Abstract

Domain-transfer learning is a machine learning task to explore a source domain data set to help the learning problem in a target domain. Usually, the source domain has sufficient labeled data, while the target domain does not. In this paper, we propose a novel domain-transfer convolutional model by mapping a target domain data sample to a proxy in the source domain and applying a source domain model to the proxy for the purpose of prediction. In our framework, we firstly represent both source and target domains to feature vectors by two convolutional neural networks and then construct a proxy for each target domain sample in the source domain space. The proxy is supposed to be matched to the corresponding target domain sample convolutional representation vector well. To measure the matching quality, we proposed to maximize their squared-loss mutual information (SMI) between the proxy and target domain samples. We further develop a novel neural SMI estimator based on a parametric density ratio estimation function. Moreover, we also propose to minimize the classification error of both source domain samples and target domain proxies. The classification responses are also smoothened by manifolds of both the source domain and proxy space. By minimizing an objective function of SMI, classification error, and manifold regularization, we learn the convolutional networks of both source and target domains. In this way, the proxy of a target domain sample can be matched to the source domain data and thus benefits from the rich supervision information of the source domain. We design an iterative algorithm to update the parameters alternately and test it over benchmark data sets of abnormal behavior detection in video, Amazon product reviews sentiment analysis, etc.

1. Introduction

1.1. Background

With the rapid development in Internet technologies, more and more people are using the Internet and generating large-scale sets of behavior data [1, 2]. Nowadays, machine learning technologies are widely used for the automatic annotation of the data to extract insights into the purpose of decision-making [3, 4]. To learn the machine learning models for this purpose, we need sufficient data and the corresponding labels so that the model can learn from the data-label pairs to build the mapping from the data to the labels. Moreover, the size of such data-label pairs should be large enough to cover most of the patterns of the data. However, the labeling of the data is sometimes expensive. When the data are generated, it is usually not labeled automatically, and thus, the label is missing. To fill this gap, human label work is required to provide the label. In many cases, the labeling of data is time-consuming and costly. As a result, we can only label a limited amount of data and train the model with a large amount of unlabeled data with a small amount of labeled data [5, 6]. The lack of labeled data is a bottleneck for the machine learning tasks over big data. To solve this problem, various methods are developed, among which the most popular ones are the semisupervised learning and transfer learning [7, 8].

Remark 1. The research method of this paper is a new method of transfer learning, and the application objects include many different applications. Their close relations are explained as follows.
Transfer learning aims to borrow the sufficiently labeled data from another domain to help the learning of the model in a domain where the data are not sufficient. This problem is defined over a source domain with sufficient labeled data and a target domain with insufficient labeled data, but a large set of unlabeled data. The source domain and target domain have the same data space and label space. However, the data distribution of the two domains is different; thus, it is not suitable to directly merge them to train a model for the target domain. Due to the mismatch of the data distributions of source and target domains, it is important to transfer the knowledge from the source domain to the target domain. Transfer learning algorithms are designed for this purpose. For example, in the problem of video anomaly detection, the problem is to detect the anomalies from the frames of the video, and the anomalies include cycle, skater, truck, car, wheelchair, and baby cart. In the real world, the data of video are usually collected from subway stations, communities, shopping centers, etc., which are treated as the target domain. Moreover, these data are usually not labeled at all, or only a small part is labeled. Meanwhile, we can use a fully labeled data set collected and annotated by the University of California and San Diego (UCSD) [9, 10] as the source domain data to help the training of the target domain model. However, the video of UCSD is usually captured on the campus of the university, which is different from the target domain’s locations, such as shopping centers and subway stations. Thus, the data of source and target domains do not match perfectly, and the mismatching of the two domain’s video requires transferring the model trained over the source domain to the target domain to adapt the target domain data. Otherwise, the model trained over the source domain cannot fit the target domain.
Another example is the Amazon product review sentiment analysis [11, 12]. This task is to detect the sentiment (positive/negative) from the text content of a review of a product. For different product types, the buyers may have different review styles. Some product types have sufficient labeled data, while other types are in lack of labeled data. Thus, we want to use the well-labeled product reviews as a source domain to help the learning of sentiment detectors for the limited-labeled reviews of other products. Again, the domain transfer is needed to adopt the model from the source domain product to the target domain product.
In the transfer learning models, the deep convolutional neural network (CNN) [1317] is the most popular base models. It is composed of multiple convolutional layers and max-pooling layers. The convolutional layers use a filter bank to extract the location features by sliding over the input data, such as image or sentence. This model can extract both local and hierarchical features from data and thus can perform automatic engineering effectively. In this paper, we study the problem of transfer learning over a source domain and a target domain. We proposed a novel transfer learning method based on a convolutional neural network (CNN) [13] and use the squared-loss mutual information (SMI) [18] to measure the matching between the source and the target domain.

1.2. Existing Works

In traditional methods of transfer learning, to match the source and target domains, various methods are applied, such as mutual information minimization (MIM) [19], Hilbert Schmidt independence criterion (HSIC) maximization [20], and maximum mean discrepancy (MMD) minimization [21, 22]. Moreover, to train the transfer learning models, the most popular base model is the deep CNN model, which is also adapted in our paper. In this paper, we briefly introduce some state-of-the-arts relevant to our work of CNN-based transfer learning.

Long et al. [21, 23] proposed a deep CNN model for transfer learning. This model is a two-branch network. The first four convolutional layers are shared by both source and target domains, and the last three full-connection layers are domain-specific. However, in the last three layers, the output features are forced to be in the same spaces by the MMD minimization. The MMD minimization measures mismatch in the features of each layer of two domain domains by the squared distance of the means of two domains.

Zhang et al. [14] proposed a CNN model learning method for transfer learning to present both data features and attributes. The structure of this network is composed of a shared attribute-embedding CNN across domains, a domain-independent CNN, and a domain-dependent CNN for the embedding of the original data of each data sample. The outputs of these three CNN models are concatenated as the overall features of one sample and used to predict the class label. The attribute-embedding CNN is imposed to approximate the transformed attributes accurately, and the domain-independent CNN outputs are supposed to minimize the MMD of two domains.

Wang et al. [19] developed a novel CNN model learning method for both source and target domains. The CNN outputs are imposed to be independent of the domain; i.e., from the CNN model output of a data sample, we should not be able to tell which domain it is from. To measure the independence of the CNN outputs and the domain indicator, the mutual information is applied. The CNN models are learned by minimizing the mutual information between the CNN outputs and the domain indicators to obtain the domain-independent CNN representations.

Geand and Yu [24] designed a learning method of deep CNN model for transfer learning. This method is based on the source-target selective joint fine-tuning. This method is not to use all the data samples of the source domain but only use a subset of the training samples of the source domain. The selected subset is similar to the target domain samples at the low-level characteristics. To this end, the algorithm constructs the descriptors from the responses of the filter bank over the training samples. These descriptors are used to search for the subset of training samples used for the learning problem.

1.3. Our Contributions

In this paper, we propose a novel CNN-based deep learning framework for transfer learning. Our work is motivated by a phenomenon: the mismatch between the source domain and target domain is usually at the entire data set level but at the data sample level. Moreover, it is also difficult to find the one-to-one matching between the samples of the source and target domains. Thus, we propose to construct a proxy of each target domain in the source domain. This proxy is constructed by a linear combination of the source domain samples. We also argue that the CNN model is a powerful tool to extract the features of both source and target domains and an optimal choice for the proxy construction. Thus, we first extract the convolutional features from both source and target samples and then apply the proxy construction. We hope the constructed proxy of each target domain can match the original target sample convolutional representation as well as possible. In this paper, we propose to use an information theory measure to measure the quality of matching between the proxy and convolutional features of target domain samples. This measure is the squared-loss mutual information (SMI) [25, 26], which measures the Pearson divergence of the joint probability of proxies and convolutional features and the product of the independent profanities of proxies and convolutional features. We further release the calculation of the SMI by firstly designing a parametric density ratio estimator of proxy and convolutional feature of target domain samples and then approximate it as the squared matching error of the true density ratio against the density ratio estimator.

To learn the proxy construction coefficients, we model the learning problem as a minimization problem. In this minimization problem, we minimize the SMI, the classification loss of labeled samples, and the classification response entropy of the unlabeled samples of both domains. Moreover, the manifold regularization and the model complexity are also considered in the minimization problem. The contributions of this paper are listed as follows:(1)We build a novel transfer learning schema, which firstly represents the source and target domains by CNN models and then constructs proxies for the target samples from the source domain. The construction is guild by the SMI minimization. In this way, the matching of the two domains is directly applied at the proxy level, while the CNN representations can still keep the domain characteristics. A novel SMI approximation is proposed as the squared matching error between the density ratio function and its parametric estimator.(2)We model the problem as a minimization problem with SMI, classification loss, classification entropy, manifold regularization, and model complexity regularization. This minimization is a joint learning framework to optimize the CNN model parameters and proxy construction coefficients simultaneously.(3)We develop an iterative algorithm to solve the minimization problem based on the alternating direction method of multipliers (ADMM) [27]. Moreover, the parameter of the density ratio estimator of the SMI is also updated in this algorithm. With this algorithm, the SMI parameter is also automatically approximated together with the other variables of the objective.

1.4. Paper Organization

This paper is organized as follows: in Section 2, we introduce the proposed transfer learning method based on CNN and proxy learning; in Section 3, we evaluate the proposed method experimentally over some benchmark data sets; in Section 4, we give the conclusion of this paper and some future works.

2. Proposed Method

In this section, we will propose the novel transfer convolutional neural network learning method. Firstly, we will build the objective function for the learning of the source and target domain convolutional network, and then, we will discuss how to minimize this objective function and finally develop an iterative algorithm based on the optimization results.

2.1. Objective Function

Suppose we have a training set of two domains. In the target domain, we have n training samples , where is the i-th training sample of the target domain. To represent a target domain sample , we use a deep CNN model, f, to convert it to a dt-dimensional vector:

Meanwhile, we have a set of source domain training set , which has m samples and is the j-th source domain training sample. Similar to the target domain, we use another deep CNN model, , to convert a source domain sample, xs to a ds-dimensional vector:

To learn the network parameters Θ and Φ, we discuss the following problems.

2.1.1. Squared-Loss Mutual Information

Due to the mismatch of the source and target domains, we propose to leverage the two domains by constructing a proxy for each target domain sample in the source domain space. To this send, for a target domain sample, xti, we denote its proxy , as a ds-dimensional vector. Moreover, we impose that it can be reconstructed by a learning combination of the convolutional representations of source domain samples:where is the convolutional vector of the j-th source domain sample and is the nonnegative weight of the j-th source domain sample for the construction of the i-th proxy.

To encourage the matching of the source and target domains, we argue to measure their quality of matching by the density ration of the proxies and convolutional vectors of the target domain samples z and f. For this purpose, we firstly measure the density ratio of z and f, denoted as . According to the definition of density ratio [18],where is the joint probability of f and z, is the probability of f, and is the probability of z. Since it is difficult to directly estimate the density ratio, we propose to learn a parametric density ratio estimator function:where is the joint probability of and , is the probability of f, and is the probability of z. Since it is difficult to directly estimate the density ratio, we propose to learn a parametric density ratio estimator function. is the concatenation of the f, z, and their element-wise difference vector, is the parameter vector, and is the Sigmoid activation function. To learn ϕ, we minimize the SMI between and , defined as follows:

To estimate the SMI given the training samples of target domain and their proxies, we collect the set of convolutional vectors and corresponding proxies . We assume the distributions of , , and are uniform distributions, which lead the probability functions as

With probabilities in (7) and the empirical approximation, we rewrite the second and third terms of (6) as follows:

By substituting (8) to (6), we have the approximated SMI aswhere is a constant. To obtain a good quality estimator, we propose to minimize the squared-loss mutual information learned from the parameter of :

With this optimal , we have to use the SMI as a measure of the matching quality between and . Using this measure as a term of loss for the purpose of learning the convolutional representations and proxy , we organize these parameters as matrices as and . With these variables and the SMI as the matching measure, we seek to learn them to maximize the SMI. In other words, we minimize a loss function derived from SMI, defined as :

We rewritewhere accordingly.

The squared-loss mutual information regularization problem is modelled as

By minimizing this objective, we seek to learn the convolutional representation vectors of both source and target domain samples, target domain sample proxies, and other parameters to match the two domains as much as possible. Please note that, in this problem, we are not matching the source and target domain samples directly but seek to match target domain samples and the proxies of target domain samples in the source domain space.

2.1.2. Semisupervised Classification

In both the source and target domains, there are both labeled and unlabeled data samples. To learn from them, we use the semisupervised method. We discuss the semisupervised learning problems in both source and target domains as follows.

Source Domain. To approximate the class label vector, , of a source domain sample, , from its convolutional feature vector, , we design a linear classification function:where is the parameter vector of the classifier and is the classification response. To learn the classifier parameter , we utilize both the labeled and unlabeled samples of the source domain. We denote the set of the labeled samples of source domain as and the set of unlabeled as . For the samples in , we minimize their classification errors measured by the squared loss:where is the class label vector of the and . For the samples in , even we do not have the available labels to constrain the classification responses in , and we still can regularize the learning of the parameter by its entropy. For a sample , we hope the uncertainty of its classification responses in is as small as possible. The uncertainty is measured by its entropy: is the -th dimension and is the -th column of . We minimize the entropy of the source samples in as follows:

Target Domain. In the target domain, we also have labeled set denoted as and unlabeled set denoted as . To use the labeled samples of the target domain, instead of learning a classifier in the target domain space, we design a classifier in the target domain sample proxy space. For a proxy , we apply a linear classifier function to approximate its label vector:where is the classifier parameter and is the estimated class label response vector. For a labeled sample, , its class label vector is known, and we minimize the classification loss measured by the squared loss between and :

Meanwhile, for the samples in the target domain , we also hope their uncertainty can be minimized, and accordingly, we minimize the entropy of the classification responses:

2.1.3. Manifold Regularization

Moreover, we also hope the classification responses of samples in both source and target domains can be smooth over their manifold. To be specific, the classification responses of two neighboring samples should also be similar. To this end, we first construct the nearest neighbor graphs for both the source and target domains and then use the graphs to regularize their classification responses.

Source Domain. We firstly build a convolutional graph from the convolutional representations of the source domain samples . For each sample , its nearest neighbor set is defined as , and we calculate its affinity between itself and each neighbor based on a Gaussian kernel:where is the Gaussian kernel function. Based on the neighborhood affinity, we hope the squared norm distance between and its neighbors can be minimized so that the neighborhood structure can be kept in the convolutional representation space:

Target Domain. In the target domain, we firstly construct the graph from the original convolutional representations of the target samples , not the proxies. Given a target sample , we find its nearest neighbors and denote the set of nearest neighbors . We calculate the similarity between and a sample also by kernel function:

With this affinity measure, different from the source domain, we use it to regularize the learning of the proxies of the target domain, , i.e.,

2.1.4. Overall Objective Function

Our overall object is the combination of the above objectives:where and are the tradeoff parameters, which weight the loss terms of the objective. The last term is the squared norms of the mode parameters to prevent the overfitting problem. In this objective, we impose the following condition:(1)In the target domain, we hope the SMI between the samples and their proxies constructed from the source domain can be maximized so that the source and target domains can be matched well.(2)In both target and source domains, we hope the classification function can approximate the ground truth of labeled well, while the uncertainty of classification results of unlabeled can be minimized.(3)Again, in both source and target domains, we hope the classification results can be similar among the neighborhoods.(4)Finally, we hope the overall complexity of the model can be minimized as much as possible; thus, we minimize the squared norm of the parameters of the model in the last term.

With this objective, we model the learning problem as a minimization problem:

By solving this problem, we can learn the source and target CNN models and and the classification layer parameters and . Moreover, the proxy parameters of the target samples, A, are also optimized together with the model parameters to match the source and target domains.

2.2. Problem Optimization

To solve the problem at (27), we adopt the ADMM algorithm, and according, we have the following argued Lagrangian function:where the following holds:(i) is the dual variable of constraint , is the dual variable of constraint , is the dual variable of constraint , and is the dual variable of constraint .(ii), and .(iii) are the corresponding penalty parameters of the constraints.

According to the ADMM algorithm, we optimize the parameters and the dual variables iteratively as follows.

Optimizing F. To optimize the target convolutional vectors of , we update one by one. The subgradient of with regard to is

The subgradient descent step to update is

Optimizing G. Similar to F, we also optimize of one by one, and the following subgradient algorithm is used to update :where if is true and otherwise.

Optimizing Z. We use the subgradient descent algorithm to update one by one, and the updating rule is

Optimizing. The gradient descent algorithm is applied to update :and is the gradient function of the convolutional network function with regard to the network parameters, usually the convolutional filters.

Optimizing. We use the gradient descent algorithm to optimize to minimize :

Optimizing. To optimize the classifier parameter in , we also update the columns one by one. To optimize the -th column, we use the following subgradient descent step:

Optimizing. To optimize , we also update the columns one by one. To update the k-th column , we have the following subgradient descent step:

Optimizing A. The optimization of the proxy construction coefficient matrix is a constrained minimization problem as follows:

Again, we optimize the columns of one by one. To optimize the i-th column of , we have the following subproblem:where is an all-one vector and is an all-zero vector. This is a typical quadratic programming (QP) problem, and we use the active set algorithm to solve it.

Optimizing. To optimize the dual variables, we have the following maximization problem:are the residual matrices and vector and is the trace of matrix . To solve this problem, we use the gradient-ascent algorithm, and the updating rules arewhere is the ascent step size.

Besides solving the minimization problem, we also need to solve the parameter of the SMI term of (13) . To learn the parameter of , according to (10), we have the following minimization problem:

The gradient function of regarding is

We use the gradient descent algorithm to update to minimize :

3. Experiments

In this section, we experimentally evaluate the proposed algorithm over benchmark data sets.

3.1. Benchmark Data Sets

In our experiment, we use the following benchmark data sets:(i)Amazon review data set. This data set has four domains of four products, including books, DVD, and music. For each domain, there are 2,000 positive reviews and 2,000 negative reviews. To fit the content of the review to the CNN network, we firstly tokenize the review to a sequence of words and secondly represent each word to by word embedding and then use the 1D-CNN as the CNN model.(ii)UCSD anomaly detection data set. This data set has 70 sequences of video, which contains 13,900 frames. This data set is used as a source domain. Meanwhile, we also collected 122 sequences of videos from subway stations, malls, and communities in China as a target domain. The target domain has 24,400 frames. Both target and source domain data sets have 6 classes of anomalies.(iii)TRECVID video concept detection data set. This data set has two domains of data. One domain is the data of TRECVID 2005, which has 61,901 frames from video, while the other one domain is the data of TRECVID 2007, which has 21,532 frames. The classification problem is to categorize one frame to one of the 36 semantic concepts.(iv)Protein subcellular localization data set. The last data set is a protein data set. It is composed of three domains. The first domain is the MultiLoc data set which has 5,859 proteins, the second one is the BaCelLoc date set with 4,286 proteins, and the last data set Euk-mPLoc has 5,618 proteins. The problem of this data is to predict the subcellular locations of each protein from the amino acid sequence. To this end, we first map each amino acid by embedding technology and then use the 1D-CNN model to extract the features.

The statistics of the data sets are summarized in Table 1.

3.2. Experimental Setting

To perform the experiment, we design a leave-one-out protocol to use each domain as a target domain, while the other domains are combined as one single source domain. With each target domain-source domain configuration, we use the 10-fold cross-validation protocol to perform the training test process. The target domain data set is split into ten folds, one fold is used as the test set, and the remaining nine folds are combined with the source domain as the training set. The model is learned over the training set and tested over the test set. To measure the performance of the method, we calculate the average classification accuracy over different target domains.

3.3. Experimental Results

In our experiments, we firstly study the properties of the proposed method experimentally and then compare it to the other state-of-the-art methods.

3.4. Convergence Analysis

Since the algorithm is an iterative algorithm, we are interested in the convergence of the algorithm. We plot the curves of the objective values against the iteration number in Figure 1. From this figure, we can see that, for all four benchmark data sets, the algorithm converges well at 50 iterations. For the protein data set, the object value keeps decreasing but cannot reach a lower value after 50 iterations. The possible reason is that the algorithm finds a local minimum object instead of a global minimum. The overall conclusion for the convergence of the algorithm is for all the tested data sets, and the algorithm can converge at certain number of iterations. This is due to the effective alternate optimization method.

3.5. Tradeoff Parameter Analysis

We have seven tradeoff parameters in the objective, which control the weights of different objective terms. We also study how the performance of the algorithm changes with the change in these tradeoff parameters. The classification accuracies with different values of the tradeoff parameter values are shown in Figure 2.

From this figure, we have the following observations:(i)The increasing and improve the performance significantly since it is the supervision term of the labeled data. This means the supervision of the labeled plays an important role in the learning of predictive models in the transfer learning problem.(ii)The increasing and also boost the performance; however, it is not as significant as the class error terms’ tradeoff parameters. The reduction of entropy of the unlabeled samples’ classification responses from both domains also improves the performance.(iii) and are the tradeoff parameters of two domains’ manifold regularization terms. Their increasing values also improve the accuracy of the target domain prediction. This means the neighborhood smoothness is also important for the transfer learning problem.(iv)The performance is stable to the change in , the tradeoff value of the squared norm term, which means this term did not play a critical role in the learning problem.

3.6. Comparison to Other Methods

We compare our proposed algorithm against several state-of-the-art CNN-based transfer learning methods, including the methods developed by Haque et al. [11, 12], Zhang et al. [14], Long et al. [21], and Pan and Yang [7]. The comparison results are shown in Figure 3.

From this figure, we can see the proposed method always outperforms the other methods in all cases. The second best method is Long et al. method. The results show the advantage of our method, especially the SMI-guided domain-transfer CNN learning and the proxy mapping mechanism. Among the four data sets, the most difficult one is the TRECVID, where all the classification accuracies of different methods are below 0.75. However, in this data set, our methods give the most significant improvements over the other methods. It is the only method which achieves an accuracy higher than 0.72.

4. Conclusions

In this paper, we proposed a novel CNN-based transfer learning method. We construct a proxy for each target domain sample in the source domain space and use the SMI as a measure to match the constructed proxy and the convolutional representation of a target domain sample. The learning of the proxy construction parameters and the CNN parameters are learned simultaneously by a unified learning framework. The estimation and parameter learning of the SMI is also driven by the proxy and CNN outputs. Thus, this framework can optimize the CNN, proxy, and SMI parameters jointly and automatically. Experimental results show the advantages of the proposed framework and algorithm.

Data Availability

All the data sets used in this paper to produce the experimental results are publicly accessed online.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the work reported in this paper.

Acknowledgments

This work was supported by the Beijing Natural Science Foundation (no. 9194027), National Key R&D Program of China (Grant no. 2018YFC0704800), 2018-2019 Excellent Talent Program, Xicheng District, Beijing, and National Natural Science Foundation of China (no. 71904095).