Abstract

Recently, infrared human action recognition has attracted increasing attention for it has many advantages over visible light, that is, being robust to illumination change and shadows. However, the infrared action data is limited until now, which degrades the performance of infrared action recognition. Motivated by the idea of transfer learning, an infrared human action recognition framework using auxiliary data from visible light is proposed to solve the problem of limited infrared action data. In the proposed framework, we first construct a novel Cross-Dataset Feature Alignment and Generalization (CDFAG) framework to map the infrared data and visible light data into a common feature space, where Kernel Manifold Alignment (KEMA) and a dual aligned-to-generalized encoders (AGE) model are employed to represent the feature. Then, a support vector machine (SVM) is trained, using both the infrared data and visible light data, and can classify the features derived from infrared data. The proposed method is evaluated on InfAR, which is a publicly available infrared human action dataset. To build up auxiliary data, we set up a novel visible light action dataset XD145. Experimental results show that the proposed method can achieve state-of-the-art performance compared with several transfer learning and domain adaptation methods.

1. Introduction

Human action recognition aims to recognize an ongoing action from a video clip, which has received great attention in recent years due to its wide applications, including video surveillance [1], video labeling [2], video content retrieval [3], human-computer interaction [4], and sports video analysis [5]. Over the past decades, significant progress has been made in action recognition [6] and most of the state-of-the-art approaches for action recognition have been put into visible imaging videos [79]. In addition, many visible light action datasets have been constructed for action recognition, such as KTH [10], HMDB51 [11], and UCF101 [12]. Generally speaking, human action recognition in visible light has been well addressed and successfully applied to some applications. However, illumination change, shadow, background clutter, and occlusion remain to be great challenges for visible light action recognition [13].

With the development of sensor technology, human actions can be captured by the thermal infrared cameras instead of the visible light ones. Compared with visible light action recognition, infrared action recognition can solve the aforementioned challenges [14]. For example, the infrared thermal imaging is robust to illumination change because it can capture humans well under poor light condition when the person can almost not be seen in the visible light videos, which is very useful for night surveillance or human-computer interaction (HCI) under dim night. In addition, as the temperatures of the shadow, background clutter, and occlusion are relatively low compared with that of the humans or moving objects in infrared videos, these challenges can be well suppressed in infrared videos. With these properties, infrared action recognition can be adopted in more applications and outperform that in visible light. Therefore, infrared action recognition may become a next hot topic in computer vision in the future.

Actually, infrared and visible action data lie in different feature space and the traditional approaches for visible light action recognition cannot be directly applied to infrared action recognition due to the modality gap between them. However, the methods for infrared action recognition are limited. Furthermore, there is only one publicly available infrared dataset InfAR [20] for action recognition until now. As a result, the performance of infrared action recognition in previous works is preliminary and leaves a reasonable space to further promote its performance. To these issues, if a large amount of previously annotated videos from various visible light videos datasets can be transferred to infrared domains for recognition, considerable amount of time-consuming human efforts, such as collecting and hand labeling a large amount of infrared action videos, can be saved. In addition, as infrared and visible light videos may contain complementary information, infrared action recognition performance can be improved if the knowledge from visible light and infrared data can be properly integrated. Nevertheless, it would have at least two obstacles to integrate these two datasets. Infrared and visible light videos are captured by different sensors; the strong modality gap between them will degrade the recognition performance without effective transferable feature representation. In real-world scenario, infrared videos are usually limited while the visible ones are abundant. The imbalanced data distribution will also degrade the classification performance.

Tackling these problems, a novel Cross-Dataset Feature Alignment and Generalization (CDFAG) framework is proposed for infrared action recognition task in this paper. To be more specific, we focus on adapting, aligning, and generalizing representations from different domains to a single common feature space in order to bring the original target domain (infrared action data) and the auxiliary source domain (visible light action data) into the same feature space. And then we learn a unique classifier in that semantically meaningful aligned and generalized feature space across datasets. In this way, the modality gap between these two datasets is reduced. To better use the data in the generalized feature space, we adopt semisupervised technique so that both the labeled and unlabeled data are considered in our method. In more detail, Kernel Manifold Alignment (KEMA) [15] is adapted to cross-dataset action recognition to generate aligned representations and then cross-domain generalized features are learnt by training two novel aligned-to-generalized encoders (AGE) on the source and target datasets in parallel. To build up source domain data, we set up a new visible light action dataset called XD145. Putting all the things together, we can summarize the main contributions of this paper as follows:(i)We have proposed a novel Cross-Dataset Feature Alignment and Generalization (CDFAG) framework to address the visible-to-infrared action recognition problem. It can efficiently reduce the modality gap across datasets and generate aligned and generalized feature representations in a common space with low intraclass diversity and high interclass variance.(ii)It is the first time Kernel Manifold Alignment (KEMA) is applied in infrared action recognition field to generate aligned representations in a common latent space.(iii)We have designed a novel aligned-to-generalized encoder (AGE) model to learn generalized feature representations after feature alignment by KEMA.(iv)We achieved state-of-the-art results in visible-to-infrared action recognition compared with several transfer learning, domain adaptation, and deep learning based methods.(v)Since there are a limited number of action videos from existing benchmark visible light datasets which share the same class label with that of the InfAR dataset, we have constructed a new visible light action dataset called XD145 to build up auxiliary source domain data. And this dataset could be further utilized as the benchmark visible light action dataset.

The rest of the paper is organized as follows. In Section 2, we review some background and related works. In Section 3, we explain details of our proposed method. Section 4 presents the experimental results of our proposed method on visible-to-infrared action recognition and cross-dataset action recognition, and finally Section 5 draws the conclusion and future research lines.

In this section, we present the background and related works. We briefly review the concepts, methods in transfer learning and domain adaptation, their benefits in cross-domain action recognition, and the development status of infrared action recognition.

2.1. Transfer Learning and Domain Adaptation

The classical pattern recognition and machine learning tasks [2123] mainly adopt a robust classifier learnt by annotated training data and assume the testing data and the training data belong to the same feature space or distribution. However, it is unrealistic in real-world applications because of the high price of manual labeling training samples and environmental restrictions. Therefore, sufficient training data that share the same feature space or distribution with the testing data cannot always be guaranteed even using some feature selection methods [24, 25] without considering the distribution gap. In this case, the potential discriminability of the trained model can be limited by the insufficient training data. Of the several schools of thought addressing this problem, two prominent ones are transfer learning [26] and domain adaptation [27]. In fact, transfer learning methods are closely related but not equivalent to domain adaptation. Transfer learning aims to transfer the knowledge from a source domain to the target domain while domain adaptation methods are essentially solving transfer learning problems. Surveys like [26] show that the type of knowledge being transferred can be roughly classified into four categories: instance transfer, feature-representation transfer, parameter transfer, and relational-knowledge transfer. Our proposed method falls into feature-representation transfer by adapting the representations from different domains to a single common latent space. The literature of feature-representation transfer can be roughly divided into three kinds of adaptation problems: supervised, unsupervised, and semisupervised adaptation problems, depending on the availability of labels in different domains.

Semisupervised domain adaptation has attracted much attention in recent years. For example, a reconstruction-based domain adaptation method called latent sparse domain transfer (LSDT) was proposed in [28] for visual categorization of heterogeneous data via subspace learning and sparse representation. A -norm based discriminative robust kernel transfer learning (DKTL) method was proposed in [29] to address distribution mismatch problem of image classification across domains. Although the methods in [28, 29] achieve good performance in cross-domain image classification, to the best of our knowledge, a more challenging problem of visible-to-infrared cross-dataset action recognition was not studied. Extreme learning machine was used in [30] to address the visual knowledge adaptation problem for video event recognition and object recognition. It should be noted that the proposed method is based on feature-representation domain adaption, which is essentially different from the extreme learning machine methods, which are classifier-based domain adaptation approaches. Apart from the methods mentioned above, manifold alignment is an important kind of semisupervised domain adaptation methods, which concurrently matches the corresponding samples and preserves the geometry of each domain by graph Laplacian [31]. Actually, data manifolds alignment boils down to finding projections to a common latent space. Semisupervised method Kernel Manifold Alignment (KEMA) was proposed in [15] and has been successfully applied to multimodal visual object recognition [15], multisubject facial expression recognition [15], and multitemporal remote sensing image classification [32]. Nevertheless, Kernel Manifold Alignment (KEMA) method has not been applied in cross-domain human action recognition. Therefore, we have studied the effectiveness of the Kernel Manifold Alignment (KEMA) and adapted it to visible-to-infrared action recognition to obtain aligned feature representations across datasets. The main focus of our work is on the use of KEMA for feature alignment in visible-to-infrared action recognition which was not addressed in [15].

2.2. Cross-Domain Action Recognition via Transfer Learning

With the development of action recognition, applying transfer learning to action recognition datasets generated by different sensors such as visible light cameras, infrared cameras, RGB-D cameras, wearable sensors, or other sensor modalities has received great interests in recent years [33]. As video sequences are the most common type of action datasets, the difference in manipulating and deploying the camera to capture actions leads to different issues, for example, various camera viewing angles, cluttered background, illumination changes, and different light spectrums such as visible light spectrum and infrared spectrum, and all contribute to significant variance in the captured videos. Therefore, action recognition especially cross-domain action recognition is a challenging problem.

Transferring knowledge for cross-view action recognition is prevailing [34]. For example, Zheng et al. [35] proposed learning a pair of dictionaries simultaneously from videos pairs taken at different views to encourage each video pair to have the same sparse representation. Zhang et al. [36] proposed a linear transformation to transform source view to target view via virtual path. Wu et al. [37] proposed a method to discover a discriminative common representation space where source and target views are linked and knowledge is transferred between them. Sui et al. [38] introduced two different projection matrices to map the action data from two different views into the common space with low intraclass diversity and high interclass variance and reducing the mismatch between them. Zu and Zhang [39] introduced a method called Canonical Sparse Cross-view Correlation Analysis to address multiview feature extraction problem. Different from the above-mentioned cross-view action recognition problems, we tackle the visible-to-infrared action recognition problem, which is essentially a cross-dataset action recognition problem.

In recent works about cross-dataset action recognition, Bian et al. [40] proposed a transfer topic model (TTM) which utilized information from the auxiliary domain to assist recognition tasks in the target domain. Zhu and Shao [18] introduced a weakly supervised cross-domain dictionary learning (WSCDDL) approach which learns a reconstructive, discriminative, and domain-adaptive dictionary pair and the corresponding classifier parameters to address cross-domain image classification, action recognition, and event recognition problems. Tang et al. [41] improved the accuracy of action recognition in RGB videos by activating the borrowing of visual knowledge across different video modalities such as RGB videos, the depth maps, and the skeleton data of actions. Liu et al. [42] proposed a simple to complex action transfer learning model (SCA-TLM) to leverage the abundant labeled simple actions to improve the performance of complex action recognition. Although these works can achieve promising results in their related fields, visible-to-infrared action recognition has not attracted much attention until now.

With the revival of neural networks in recent years, many neural networks based transfer learning methods have been proposed as well. Kan et al. [43] proposed bishifting autoencoder network (BAE) to alleviate the discrepancy between source and target domains and evaluate its effectiveness in face recognition. Xu et al. [19] tackled the cross-dataset action recognition problem by training a pair of many-to-one encoders in parallel to map raw features from the source and target datasets to the same space. Although dual many-to-one encoder in [19] can generalize features well across datasets, it requires a large number of labeled training samples from both source and target datasets to learn domain-invariant features without utilizing auxiliary domain data as an aide. In addition, the inputs of the encoders are raw action features named action bank features without considering feature alignment in advance. Different from the above-mentioned neural networks based transfer learning methods, we take both feature alignment and auxiliary domain data into consideration and propose a novel aligned-to-generalized encoder (AGE) model to map the aligned feature representations to the same generalized feature space with low intraclass diversity and high interclass variance.

2.3. Infrared Action Recognition

Basically, most of the current research efforts for action recognition have been put in visible light videos while infrared action recognition has not attracted much attention. Recently, increasing efforts have been devoted to infrared action recognition. For example, Han and Bhanu [44] proposed an efficient spatiotemporal representation for human repetitive action recognition under thermal infrared scenarios. Han and Bhanu [14] introduced a hierarchical scheme to combine color and thermal images to improve human silhouette detection. Eum et al. [45] used hog and support vector machine to realize infrared action recognition at night. However, these works focus on simple actions under relatively simple environment with limited infrared data. In addition, there is no publicly available infrared action recognition dataset until Gao et al. [13] built the first public infrared action recognition dataset called InfAR. In [13], state-of-the-art action recognition pipelines including widely used low-level local descriptors were evaluated in InfAR dataset. Then Gao et al. [20] extended their previous work [13] and utilized several state-of-the-art pipelines based on low-level features and deep convolutional neural network to evaluate their new infrared action recognition dataset (InfAR). However, the best recognition accuracy in [20] is relatively low and leaves a reasonable space to further promote the performance on InfAR dataset.

Actually, transfer learning has seldom been applied to infrared action recognition. For example, Zhu and Guo [46] proposed applying adaptive support vector machine (A-SVM) [47] to adapt the existing visible light action classifier to classify infrared actions and achieved preliminary results in their own dataset. Although the adaptive support vector machine (A-SVM) based method in [46] can perform better than direct matching, A-SVM is essentially a classifier-transfer based method without considering the max-margin property for the adapted classifier on target instances; therefore it faces accuracy degradation as a result of overfitting.

Our proposed method differs from the above-mentioned approaches in such aspects that it more comprehensively projects and aligns data from different domains in a nonlinear way through kernelization to generate aligned representations and learns a dual aligned-to-generalized encoders to obtain cross-domain generalized features while considering both the discriminability and domain adaptability at the same time. In our proposed CDFAG, the learned classifier across source domain and target domain becomes more discriminative against modality gap because of the integration of both source and target domain knowledge, while a majority of previous transfer learning methods focus on incomplete target domain without utilizing other domain data as an aide to improve the performance of original recognition systems. We will detail our proposed method in Section 3.

3. The Proposed Method

In this section, we detail our proposed CDFAG. An overview of the CDFAG is presented in Figure 1. Actually, the proposed CDFAG is semisupervised as both the labeled and the unlabeled data are used in source and target training sets. Our proposed CDFAG consists of three stages. In the first stage, feature extraction and encoding are accomplished on both the source and target datasets, where improved dense trajectories (iDTs) features are extracted, encoded, and reduced to a low-dimensional subspace. In the second stage, aligned features of source and target domains are generated by Kernel Manifold Alignment, then a pair of aligned-to-generalized encoders are trained on the source and target datasets in parallel guided by the centroids of training aligned instances from each class, and after that the output values of the encoders are extracted as the ultimate generalized representations. Finally, a support vector machine is built on the generalized features extracted from both the source and target datasets and then used to classify the new features extracted from unseen samples of the target dataset.

3.1. Preprocessing
3.1.1. Feature Extraction and Encoding

In this paper, we choose improved dense trajectories (iDTs) [48] features with trajectory shape, HOF, MBHx, and MBHy as the low-level action video representation. The total length of the feature vector is 330. Specifically, we use the implementation released on the website of Wang (https://lear.inrialpes.fr/people/wang/improved_trajectories/) for iDTs and choose the default parameter setting. For iDTs, a large number of local trajectory descriptors may lead to high computational complexity and memory consumption. To cope with this issue, we adopt Locality-constrained Linear Coding (LLC) [49] scheme to represent the iDTs by multiple bases, which can bring less quantization error while preserving the local smooth sparsity. Taking both efficiency and the construction error into consideration, LLC encoding scheme is applied to the iDTs with 5 local bases, and the codebook size is set to be 4000 for all training-testing partitions. Thus, the dimension of the encoded iDTs features is 4000. To reduce the complexity when constructing the codebook, only 200 local iDTs are randomly selected from each video sequence.

3.1.2. Principal Component Analysis

After LLC encoding, the feature representations are still high dimensional and strongly correlated. To obtain more compacted feature representation, we utilize principal component analysis (PCA) [50] to preprocess these features. In our method, we retain top principal components such that the cumulative corresponding eigenvalues cover over 99% of the total eigenvalues. In our experiments, this reduces feature dimension down to the range of 500 to 600, varying between datasets.

3.2. Feature Alignment by Kernel Manifold Alignment

In this section, we detail the feature alignment method based on Kernel Manifold Alignment (KEMA). An illustration of how feature alignment functions is shown in Figure 2.

3.2.1. Notation

To fix notation, we consider K input domains. The data instances of each domain belong to c different classes. Let represent the th input domain, where is the number of samples in the th domain. The idea of kernelization is to map the input data instance into a high dimensional Hilbert space with the mapping function such that the mapped data is better suited for solving our problem linearly. Kernel trick is adopted in our method to avoid high computational load. Therefore, we define a kernel function computing the similarity between mapped instances without having to compute explicitly. Many common types of kernel functions can be adopted in KEMA, such as the RBF kernel, the linear kernel, and the polynomial kernel. In this paper, we use the RBF kernel. Considering multiple data modalities here, we would have to map the datasets to Hilbert spaces of dimension .

3.2.2. Kernel Manifold Alignment (KEMA)

The KEMA method aims to construct domain-specific projection functions, , to project the data in Hilbert space from all domains to a new common latent space, on which the instances’ topology of each domain is preserved, the instances from the same classes will locate nearly, and the ones from different classes will be far from each other. To do so, KEMA aims to find a data projection matrix that minimizes the following cost function:where TOP, SIM, and DIS denote the topology, class similarity, and class dissimilarity, respectively. is a parameter balancing the contribution of the similarity and the topology terms. As , we can see that when , more importance is given to topology and vice versa. The three terms are defined as follows:(1)Minimizing a topology-preservation term, TOP, which aims to preserve the local topology of each data domain:where in the similarity matrix representing the similarity of and , which can be computed as . is the graph Laplacian matrix issued from while is the diagonal row sum matrix defined as .(2)Minimizing a class similarity term, SIM, which encourages the locations of instances with the same class label to be close with each other in the new latent space: where in the similarity matrix is set to be 1 if two instances from domains and share the same class label and 0 otherwise (including the case when the label information is not available). The corresponding diagonal row sum matrix is defined as and the graph Laplacian matrix .(3)Maximizing a class dissimilarity term, DIS, which encourages instances with different class labels to be separated in the new latent space,where in the dissimilarity matrix is set to be 1 if two instances from domains and are from different classes and otherwise (including the case when the label information is not available). The corresponding diagonal row sum matrix is defined as and the graph Laplacian matrix .

Given (2)–(4), the optimization problem is formalized as follows:

It is straightforward that the solution of (5) boils down to finding the lowest eigenvalues of the following generalized eigenvalue decomposition [51]: where is a matrix containing the matrices in a block diagonal form and contains the row eigenvectors for the particular domain defined in Hilbert space , where , , and is the eigenvalues of the generalized eigenvalue decomposition problem. Note that and are high dimensional and cannot be explicitly computed. Therefore, the eigenvectors are expressed as a linear combination of mapped instances using the Riesz representation theorems [52] and and in matrix notation . In (6), by multiplying both sides by and replacing the dot products with the corresponding kernel matrices, , the final solution is obtained: where is a matrix containing the kernel matrices in a block diagonal form. The block structure of projection matrix is as follows: where the eigenvectors for the first domain are highlighted in bold.

Once the projection matrix is obtained, the instance from th domain can be projected to the new latent space by first mapping to its corresponding kernel form and then applying the corresponding projection vector defined therein: where is a kernel evaluations vector between instance and all instances from domain used to define the projections . Similar to eigenvalue decomposition based methods, the data can be projected onto a lower-dimensional subspace by simply preserving the first columns of , where is the total number of samples involved in the kernel matrices and . In this sense, KEMA leaves some control on the dimensionality of the latent space for feature alignment.

In this paper, the number of input domains is set to because there is only one target domain (infrared dataset) and one source domain (visible light dataset) in our experiments.

3.3. Feature Generalization by Aligned-to-Generalized Encoders

Due to the huge modality gap between infrared and visible light data, a unified subspace may not exist when only using KEMA to align features from both domains. To be more specific, KEMA holds the assumption that a unified aligned space for both the source and the target domain exists. This assumption is too strict and may be invalid for some cases. Therefore, we relax this strict assumption and learn transferable feature representations across infrared and visible domains in a hierarchical way. With the obtained aligned representation, aligned-to-generalized encoders (AGE) model is adopted to force the outputs to be identical to the input aligned instances from the same action class. The AGE is trained by the guidance of the identical representation of the same action class, where intraclass diversities are minimized and generalized representations are generated across datasets. In this section, we present the architecture and details of the proposed AGE.

3.3.1. Target Output Generation

The centroid of each action class is used as the target output, which is computed by averaging over instances’ aligned feature representations in each class. Let and denote the aligned representations of th and th training instances from class c in the source and target dataset, respectively, and the target output for instances from class c is defined as follows: where and denote the total number of instances of class from the source and target datasets.

3.3.2. Encoders Training

At the feature generalization stage, a pair of aligned-to-generalized encoders are trained on the source and target datasets in parallel. For instances, with the same action class label, the target outputs of the two aligned-to-generalized encoders are identical, which forces the two aligned-to-generalized encoders to generalize to varying inputs and guide outputs of the same class instances to be similar. In this sense, aligned instances with the same class label from two datasets are mapped to the same feature space [36, 53] with low intraclass diversity [19] and then generalized and discriminative representations of the instances are generated across datasets.

In this section, we demonstrate the benefits of using a pair of aligned-to-generalized encoders for feature generalization in visible-to-infrared action recognition. The architecture of the aligned-to-generalized encoders (AGE) is illustrated in Figure 3. The AGE are essentially fully connected feedforward neural networks with an input layer, a hidden layer, and an output layer. Although the intuition that a deeper network architecture with more than one hidden layer can learn more robust and discriminative representations, it has been shown that carefully configured and trained single-hidden-layer neural networks can also achieve good performance in many tasks [19, 54], which is validated in our experiments as well. In addition, the architectures of the AGE trained on the source and target datasets are identical.

After feature extraction and encoding, the training videos from both datasets are represented by iDTs features encoded by LLC, which are then reduced to a low-dimensional subspace via PCA. Then, the KEMA method is applied to map the raw feature representations of instances from two datasets to the common latent space to obtain aligned features, shown in Figure 3. For simplicity, we denote one training aligned instance as with dimension . Therefore, both aligned-to-generalized encoders have an input layer of size , a hidden layer of size , and an output layer of size L, where L is the length of the ultimate aligned output feature vector and and are user defined parameters. In this work, we empirically restrict to be equal to and experimented with hidden layer sizes that range from 50 to 500 in Section 4.4.3.

Although both aligned-to-generalized encoders have the same architecture, they have different parameters. The goal of training the two encoders in parallel is to find a mapping between training aligned instances and target generalized outputs. Take the source encoder as an example; the mapping is accomplished via and . and are defined as follows: where is the activation function and , , , and are the parameters for and , respectively. The encoder parameters are indicated in Figure 3. Given a hidden layer size , the weights and biases in both encoders are initialized with random numbers drawn from a uniform distribution ranging between 0 and 1. The sigmoid function was used as the activation function:

The objective functions are defined as follows: where and are defined in (11), , , , and are the weights and biases of the encoder, and is the number of training instances.

Stochastic gradient descent was utilized to minimize the objective function by iteratively updating the weights and biases. For example, is updated as follows: where is the update to at the th iteration, is the objective function value at the th iteration, denotes the learning rate, and denotes the momentum. , , and are updated in a similar way.

In our method, the trained values of the output layer are extracted as the ultimate generalized features. When the objective function is minimized, the output values of the encoder are approximate solutions staying close to the target outputs instead of being identical to the predefined target outputs. Therefore, the final features extracted from aligned instances of the same action class, from both the source and the target datasets, would lie in the same cluster with small intraclass diversity and high interclass variance. This phenomenon will be illustrated in Section 4.4.2.

To make it clear, the proposed CDFAG is summarized in Algorithm 1.

Input:
Raw features , the number of input data domains , dimension
of common latent subspace , trade-off parameter , maximum iterations 1000, input layer size L, output layer size
L, hidden layer size H, learning rate 0.1, momentum 0.9 and are randomly initialized.
Feature alignment:
(1)Map the raw features from datasets to Hilbert spaces:
(2)Construct graph Laplacian matrices , and defined in Section 3.2.2.
(3)Compute the mapping functions by finding the smallest eigenvalues of the generalized eigenvalue problem:
(4)Apply to map input datasets to the new dimensional common latent space to obtain aligned features:
Feature generalization:
(5)Calculate the target outputs of aligned-to-generalized encoders from class c:
,
and denote the aligned features of th and th training instances from class c in the source and target dataset.
(6)for iter = 1 to 1000 do
(7)Minimize objective function:
for both encoders in parallel via stochastic gradient descent.
(8)end for
(9)Take the activations of aligned-to-generalized encoders as the final generalized features.
Output:
Generalized features across different datasets.
3.4. Classification

Due to the lack of infrared data, directly using neural network for classifying may lead to overfitting. Therefore, we just use the AGE as a feature extractor rather than a classifier to avoid overfitting. In our experiments, we use multiclass support vector machine (SVM) as a classifier rather than softmax in AGE because SVM classifier could obtain better results, which has been validated in [55, 56]. To perform visible-to-infrared action classification, a SVM classifier with RBF kernel is trained on generalized features extracted from both visible light (source) and infrared (target) datasets and tested on generalized features extracted from unseen instances from infrared (target) dataset, as shown in the bottom of Figure 1. In Section 4, we will show that such classification scheme can effectively classify unseen action data in target dataset. This can be attributed to the successful knowledge transfer from the source domain to the target domain by our proposed CDFAG. To make it clear, the testing procedure of our proposed CDFAG is summarized in Algorithm 2.

Input:
Raw features of testing samples in target dataset , the number of testing samples
in target dataset , dimension of raw features , dimension of common latent subspace , trade-off parameter ,
learned projection function for target dataset , trained aligned-to-generalized encoder for target dataset
and , SVM classifier trained by samples from both source and target dataset.
Feature alignment:
(1)Map the raw features in target datasets to the Hilbert space:
(2)Apply to map the raw features in target datasets to the new dimensional common latent space to obtain aligned features:
Feature generalization:
(3)Input the aligned features into the trained aligned-to-generalized encoder and obtained the generalized features
at the output layer:
Classification:
(4)Adopt the trained SVM classifier to predicts the class labels of testing samples in target dataset using generalized features
.
Output:
Predicted labels of testing samples in the target dataset.

4. Experimental Results

In this section, we present our experimental results on the benchmark dataset. We will start with describing the individual datasets, followed by details of our experimental settings.

4.1. Datasets

The InfAR dataset (https://sites.google.com/site/gaochenqiang/publication/infrared-action-dataset/) and the XD145 dataset (the dataset will be available at: https://sites.google.com/site/yangliuxdu/) are used for the visible-to-infrared action recognition task, where the XD145 dataset is used as the source domain and the InfAR dataset is used as the target domain.(A)InfARThe InfAR dataset [20] consists of 600 video sequences captured by infrared thermal imaging cameras. As shown in Figure 4, fight, handclapping, handshake, hug, jog, jump, punch, push, skip, walk, wave 1 (one-hand wave), and wave 2 (two-hand wave) are included in the dataset, where each action class has 50 video clips and each clip lasts 4 s in average. The frame rate is 25 fps and the resolution is 293 × 256. Each video contains one or multiple actions performed by one or several persons. Some of them are interactions between multiple persons, shown in Figure 4.(B)XD145We build a visible light action dataset, named XD145, following the approach to construct an action recognition dataset from the visible spectrum [57]. In correspondence with the target domain action categories, both the XD145 and the InfAR dataset have the same action categories, as shown in Figure 5. The XD145 action dataset consists of 600 video sequences captured by visible light cameras and there are 50 video clips for each action class. All actions were performed by 30 different volunteers. Each clip lasts for 5 s in average. The frame rate is 25 fps and the resolution is 320 × 240. As shown in Figure 5, the background, pose, and viewpoint variations are considered when constructing the dataset in order to make our dataset more representative for real-world scenarios.

Figure 6 illustrates sample actions from the InfAR and the XD145 datasets. As can be seen in the figure, these action videos are captured in two different light spectra and they exhibit significantly great intraclass variance and modality gap.

4.2. Experimental Settings

In all experiments, each dataset is randomly split into training and testing sets. For evaluation, the average precision (AP) is used, which is the average of recognition precisions of all actions. For each evaluation, we repeat the experiments with the same setting 5 times and report the average accuracy. In KEMA, we use the RBF kernels with the bandwidth fixed as half of the median distance between the samples of the specific video (labeled and unlabeled). By doing so, we allow different kernels in each domain, thus tailoring the similarity function to the data structure observed [15]. The trade-off parameter is set as according to the result of experimental analysis in Section 4.4.3. To build the graph Laplacians, we use a -NN graph with . We validate the optimal dimension of common latent space as well as the optimal and parameters in the SVM classifier. Since the RBF kernel is adopted in KEMA, the LibSVM [58] is used to train classifiers in our experiments with RBF kernel as well. The optimal parameters and are determined by 5-fold cross-validation. When performing stochastic gradient descent for encoder training, we set learning rate and momentum (see (14)). Each encoder is trained for about 1000 iterations. All experiments are conducted with MATLAB R2016b on a 64-bit Windows 10 PC with 4-core 3.60 GHz Intel i7 CPU and 16 GB of memory.

4.3. Action Recognition Results with Raw Features

We evaluate the original feature in our newly constructed visible light action dataset XD145 and infrared action recognition dataset InfAR, respectively, which is constructed as baseline in this paper. For each evaluation, we repeat the experiments with the same settings 5 times, where, for each time, the numbers of 25/20/15/10/5 samples out of 50 samples for each action category are randomly selected as the training set and samples out of the remaining samples are used as the testing set. Then, the averages are employed as the final result, as shown in Table 1.

From Table 1, we can observe that the recognition accuracies in both datasets are growing with the number of training samples increasing. This is basically in accordance with the traditional action recognition intuition that a good action recognition result can be achieved with adequate amount of labeled training samples and discriminative features. Although the same feature is adopted in both datasets, the accuracies in InfAR are much lower than that in XD145 even with the same experimental setting. This result may be due to the fact that the videos in these two datasets are captured by different sensors and they exhibit variant appearance and motion information for the same action class, as shown in Figure 6. Since the iDTs feature is good at appearance and motion description in visible light action videos [48], its effectiveness and strength may not be revealed in infrared videos. Therefore, utilizing existing visible light action data as an aide for enhancing infrared action recognition system is urgently needed.

4.4. Visible-to-Infrared Action Recognition Results

In this section, we evaluate our proposed method on visible-to-infrared action recognition. Our experiments are mainly divided into four parts. Firstly, a classifier trained on instances from both source and target datasets without feature alignment and generalization is utilized to predict actions in target dataset. Secondly, the classification results with aligned features obtained by KEMA are also provided. Then, we evaluate our proposed CDFAG in visible-to-infrared action recognition. Lastly, we compare our proposed CDFAG with several state-of-the-art methods.

4.4.1. Visible-to-Infrared Action Recognition

To find out whether modality gap is existent between the source and target datasets, we first train a classifier using the samples from both source and target datasets without feature alignment and generalization to predict actions in target dataset. We call this method No Adaptation (NA). In addition, the classification results with aligned features obtained by KEMA are provided to show the effectiveness of feature alignment. For each evaluation we repeat the experiments with the same parameter settings 5 times, where, for each time, the numbers of 45/40/30/25/20/15/10/5 samples for each action category in source dataset combined with the number of 25/20/15/10/5 samples for each action category in target dataset are randomly selected as the training set, in order to validate the impact of the number of training samples on recognition accuracy. Then, 20 samples for each action category in target dataset are used as the testing set. The averages are employed as the final result, as shown in Table 2.

We can see that, in Table 2, the best accuracy of the NA in each column (marked bold) occurs when the number of training samples in source dataset (S_train) is relatively small. Comparing the results of infrared action recognition in Table 1 and the NA in Table 2, we can discover that the No Adaptation (NA) method results in better performance than the baseline method only when fewer samples are used for training (T_train = 10 and T_train = 5), which demonstrates that directly transferring the knowledge from the source domain to target domain without considering their divergence can cause significant performance degeneration especially when the number of source domain samples is large. When the number of source and target training samples is small, the performance gets slightly better because of the complementary information between the source and target training samples. However, with the number of source and target training samples increasing, the modality gap between datasets begins dominating the recognition accuracy; then the performance degeneration becomes more serious. Therefore, it is necessary to reduce the modality gap before directly using them together.

Then, we evaluate our proposed CDFAG in visible-to-infrared action recognition. At feature alignment stage, we use the labeled and unlabeled samples to extract KEMA projections and then project all videos in the latent space to obtain aligned samples. We set the dimension of features in common latent space to be 100. We experiment with various feature dimensions and report the results in Section 4.4.3. For videos from the same action class in source dataset XD145, we randomly choose the number of 45/40/35/30/25/20/15/10/5 videos as the training set while using the number of 5 videos as the unlabeled set. For videos from the same action class in target dataset InfAR, 20 samples out of 50 samples are randomly selected as the testing set; then we randomly choose the number of 25/20/15/10/5 videos as the training set while using the remaining videos as the unlabeled set, since the target dataset usually has more unlabeled samples but less labeled samples than the source dataset in real-world scenarios. The unlabeled samples are utilized to compute the graph Laplacians. At feature generalization stage, all the aligned samples from both of the source and target training sets are used to guide the training of the two aligned-to-generalized encoders. After feature generalization, a SVM classifier with RBF kernel is trained on all the aligned and generalized labeled training samples from both source and target datasets. The trained SVM is used to predict all the testing videos in target dataset. All evaluation results are listed in Table 2.

Comparing the results of the NA and the KEMA in Table 2, it is obvious that the KEMA achieves higher accuracies than that of the NA under all parameter settings which validates the effectiveness of KEMA in aligning the features across the source and target domains. In addition, it is obvious that the infrared action recognition accuracies of our proposed CDFAG are significantly much higher than that of No Adaptation (NA) and the KEMA under all parameter settings, which validates the effectiveness of our proposed CDFAG in reducing the modality gap between the source and target datasets by using both feature alignment and generalization. Furthermore, we achieve at least 5% increase in infrared action recognition accuracy over the baseline method in Table 1 (the third column) as a result of utilizing source domain data as an aide for enhancing the present recognition system by a novel feature alignment and generalization method. In order to have a more intuitive comparison, best accuracies under different number of target training samples for each method are plotted in Figure 7. As can be illustrated in Figure 7, our proposed method achieves remarkable performance improvement in infrared action recognition under all parameter settings especially with fewer training samples, which verifies its effectiveness in utilizing auxiliary source domain data under scenario of scarce target training data.

To further explore the infrared action classification performance, two confusion matrices are illustrated in Figure 8. It can be seen that our proposed method achieves higher classification accuracies for nearly all action categories compared with the baseline method. However, there are limited accuracies improvement in two actions—push and punch—since these two actions are easily confused with each other. From Figures 4(g) and 4(h), we can see that punch and push are so similar that it may even be difficult for a human to distinguish them from each other.

4.4.2. Visualization of Aligned and Generalized Features

To verify that, by using our proposed CDFAG, all action data from different datasets are indeed projected into the unique common latent space, we plotted the distribution of the raw features, aligned features, and generalized features of instances from all action classes in source and target datasets, respectively. For illustration purposes, we compared the first 3-dimensional feature space of the raw iDTs features, aligned features, and the learnt generalized features, shown in Figure 9. As can be seen in Figure 9(a), the original features of instances from two datasets are obviously in different clusters with large intraclass diversity and small interclass variance, but they are projected into a single cluster where the instances from the same class are projected to similar locations and the instances from different classes are well-separated, as illustrated in Figure 9(b). Although there is small distinction between Figures 9(b) and 9(c) except that the instances from the same class in Figure 9(c) merge into a more compact cluster, feature generalization indeed maps instances from different datasets to a more compact feature space with relatively small intraclass diversity and large interclass variance, which makes features more generalized and discriminative for visible-to-infrared action recognition.

4.4.3. Analysis of Trade-Off Parameter

To evaluate the optimum value range of trade-off parameter in (1), we evaluate the performance of our proposed CDFAG with different values of . Specifically, we use in InfAR dataset when S_train = 45 and T_train = 25, S_train = 40 and T_train = 25, and S_train = 35 and T_train = 25. The results of other S_train and T_train settings are not evaluated because other settings show similar results to these three settings. For each evaluation, we repeat the experiments with the same settings 5 times and report the average accuracies.

The experimental results are given in Table 3. As shown in Table 3, the average accuracies of these three settings are better when trade-off parameter is small, which indicates that a good performance can be achieved when we attach more importance to class similarity minimization term than topology preservation in KEMA procedure. When we treat both terms equally (), the performance is also unsatisfactory. These attribute to the fact that the modality difference between infrared and visible light domain is so huge that the topology between them also varies a lot. Therefore, more importance should be attached to class similarity term to achieve good performance. However, when is set as 0, the overall performance drops dramatically. This shows that the topology preservation plays an important role in KEMA and cannot be neglected. When is set as 1, the performance also drops but with less accuracy gap than the case when . This means that topology preservation contributes more to the overall performance than class similarity minimization although both of them are necessary in KEMA.

More intuitive analysis is plotted in Figure 10. From Figure 10 we can see that a good result can be achieved when and the optimum value of is . Therefore, we set in the experiments. In addition, the performance is relatively stable with small accuracy gap when , which shows that our algorithm is insensitive to the trade-off parameter when .

4.4.4. Feature Dimension in Common Latent Space

We experiment with various feature dimensions in common latent space to study how the feature dimension influences the classification accuracy. As the hidden layer size in aligned-to-generalized encoders is empirically set to be equal to the input layer size , the hidden layer size is directly determined by the value of the feature dimension in latent space. As illustrated in Figure 11, our proposed method arrives at its best accuracy when feature dimension ; then classification accuracy tends to decrease as the feature dimension increases. This can be explained by the decreased discriminability of the feature representations in common latent space as its dimension increases.

4.4.5. Computation Time

We evaluate the computation time of our proposed method and report the results in Table 4. All experiments are conducted on our lab PC and developed by MATLAB. The reported times are averaging running time over all the S_train values when the T_train remains unchanged, and the computation time for iDTs features and LLC encoding is not included. The reported time includes feature alignment, feature generalization, classifier training, cross-validation and classification. As can be seen from Table 4, our proposed method can perform feature alignment, feature generalization, classifier training, and classification very efficiently. The longest computation time is just nearly 9 minutes when and T_train = 25. We attribute this efficient execution to the PCA dimension reduction of raw features, fast feature alignment by introducing the kernel trick in KEMA, and the shallow single-hidden-layer neural network architecture for feature generalization.

4.4.6. Comparison with State-of-the-Art Methods

We compare our proposed CDFAG method with the following state-of-the-art methods:(i)Domain adaptation based methods: KEMA [15], SSMA [16], and DA [17].(ii)Transfer learning based methods: WSCDDL [18] and Dual [19].(iii)Deep learning based methods: two-stream CNNs [20].

Actually, we focus on the transferable features generation in this paper and only use the features generated by these comparison methods including KEMA, SSMA, DA, WSCDDL, and Dual. And then we use SVM as the classifier to recognize the actions. In order to compare with these domain adaptation and transfer learning methods, we use the same experimental setting as our proposed CDFAG (Section 4.4.1) in these related works. Results are reported in Table 5. The results in boldface in each column shows that the proposed CDFAG is the most competitive one compared with other state-of-the-art methods. For instance, the average accuracy of the proposed method CDFAG brings about 20.50%, 22.09%, 4.16%, 4.66%, and 6.17% improvements over the baseline method for five different T_train values, respectively. This validates the effectiveness of the CDFAG in improving the overall infrared action recognition accuracy with the aid of the visible light data from source domain. Compared with domain adaptation methods, the average accuracy of the CDFAG is about 6.00%, 6.59%, and 11.75% higher than that of KEMA [15], SSMA [16], and DA [17] when T_train = 25, respectively. For transfer learning methods, our proposed method also brings about 8.00% and 17.17% improvements over WSCDDL [18] and Dual [19] when T_train = 25, respectively. More comparisons are plotted in Figure 12.

The results in Figure 11 show that the proposed CDFAG performs much better than other state-of-the-art methods on visible-to-infrared action recognition task. As can be seen from Figure 11, the performance of Dual [19] is not good especially with fewer target training samples because it needs adequate training samples to be well-trained. On the one hand, other state-of-the-art methods can achieve higher accuracies than that of the baseline method with fewer target training samples, while, with more target training samples, modality gap begins dominating the recognition accuracies and their performances become inferior to that of the baseline. On the other hand, our proposed method CDFAG can perform well whenever the number of training samples is small or large, which validates the effectiveness of our proposed method CDFAG in terms of reducing the modality gap across the source and target datasets. The key difference between our proposed CDFAG and the other methods is that our method takes both feature alignment and feature generalization into consideration. Thus, a latent common feature space with low intraclass diversity and high interclass variance can be learnt. The other methods focus on only either of them (feature alignment or generalization) without explicit knowledge transfer and effective modality gap reduction. The strong performance of our proposed CDFAG demonstrates the advantage of transferring knowledge from visible light domain to infrared domain and, more importantly, the efficacy of the proposed cross-dataset feature alignment and generalization (CDFAG) framework.

We also compare the proposed CDFAG with state-of-the-art deep learning based method. Two-stream CNN is adopted in [20] and achieved average precision 76.66% in InfAR dataset, while the best accuracy of our proposed method is 75.42%. It is evident that our proposed CDFAG can achieve comparable infrared action recognition performance to that of the deep learning based method in [20]. In addition, the proposed CDFAG can still achieve good performance efficiently with fewer labeled training samples in target dataset while the deep learning based method in [20] is time-consuming and needs a large amount of training instances or pretrained models to perform well. Therefore, our proposed CDFAG may be a good visible-to-infrared action recognition framework that makes a good balance between classification accuracy and time efficiency.

5. Conclusion and Future Work

In this paper, we propose a novel Cross-Dataset Feature Alignment and Generalization (CDFAG) framework for visible-to-infrared action recognition. The proposed CDFAG is essentially a feature extractor that finds projections from all the source and target domains into a common latent feature space where features of all instances are semantically aligned and generalized with low intraclass diversity and high interclass variance. Promising results are achieved on visible-to-infrared action recognition and cross-dataset recognition tasks, where auxiliary source domain knowledge is effectively transferred to target domain. Compared with several state-of-the-art transfer learning or domain adaptation based methods, our proposed CDFAG offers a more flexible framework and achieves the best performance (AP = 75.42%) in infrared action recognition. In addition, our proposed CDFAG also achieves comparable infrared action recognition performance to deep learning based method. In the future, we will adapt the CDFAG method to other cross-domain action recognition tasks, such as cross-view, cross-dataset, and image-to-video action recognition. Another interesting direction is to modify existing visible light action recognition methods for infrared datasets.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (no. 61502364), the China Postdoctoral Science Foundation funded project (Grant no. 154906), and the Fundamental Research Funds for the Central Universities (Grant no. 3102016ZY022).