Abstract

Domain transfer learning aims to learn common data representations from a source domain and a target domain so that the source domain data can help the classification of the target domain. Conventional transfer representation learning imposes the distributions of source and target domain representations to be similar, which heavily relies on the characterization of the distributions of domains and the distribution matching criteria. In this paper, we proposed a novel framework for domain transfer representation learning. Our motive is to make the learned representations of data points independent from the domains which they belong to. In other words, from an optimal cross-domain representation of a data point, it is difficult to tell which domain it is from. In this way, the learned representations can be generalized to different domains. To measure the dependency between the representations and the corresponding domain which the data points belong to, we propose to use the mutual information between the representations and the domain-belonging indicators. By minimizing such mutual information, we learn the representations which are independent from domains. We build a classwise deep convolutional network model as a representation model and maximize the margin of each data point of the corresponding class, which is defined over the intraclass and interclass neighborhood. To learn the parameters of the model, we construct a unified minimization problem where the margins are maximized while the representation-domain mutual information is minimized. In this way, we learn representations which are not only discriminate but also independent from domains. An iterative algorithm based on the Adam optimization method is proposed to solve the minimization to learn the classwise deep model parameters and the cross-domain representations simultaneously. Extensive experiments over benchmark datasets show its effectiveness and advantage over existing domain transfer learning methods.

1. Introduction

1.1. Background

Transfer learning is a machine learning problem which deals with data from two domains [16]. One domain is target domain, and in this domain, we aim to learn an effective machine learning model for prediction. Another domain is source domain, in which we have sufficient labeled data points. Usually, in the target domain, the labeled data points are of a small number, which is not sufficient to learn an effective model. Thus domain transfer learning tries to transfer the knowledge in the source domain to the target domain to help the learning in the target domain. Although the target domain and source domain share the same input and output space, the distribution of the input data points of two domains is of significant difference. For example, in the problem of text topic categorization, the newspaper article is a source domain, where almost all the articles are labeled well, and the personal communication message is a target domain. Usually, the message texts are not labeled, or only a small number of them are labeled. It is natural to use the newspaper articles and the corresponding labels to help the learning of model for the categorization of the message texts. However, the newspaper articles are written normally, while the personal messages are usually written casually. Thus, the usage of words and the writing styles are very different. This leads to the significant difference between the distributions of the source domain (newspaper article) and the target domain (personal message). Transfer learning aims to build a predictive model for the target domain by utilizing the data points of both domains, even though they are of different distributions.

In this case, it is very necessary to map the data points of both domains to a common data space so that they lie in the same distribution, and we can directly train a model for the target domain by using both domains’ data points’ representations. Another solution is to learn a model for the source domain first and then adapt it to the target domain. In this paper, we focus on the first solution where the data points are mapped to a common space. This solution aims to learn domain transferable representations for data points in different domains. Different representation learning methods have been applied for domain transferable representation learning, including multikernel learning [710], deep learning [1120], nonnegative matrix factorization [2124], sparse coding [2528], etc. For the domain distribution matching-based domain transfer learning, the most popular method is based on the maximum mean discrepancy criterion. It calculates the means of the representations of data points of source and target domain and minimizes the squared norm distance to match the two domains.

In this paper, we study the problem of learning domain transfer representations. However, we do not consider the distribution matching of two domains but consider learning representations which can be directly generalized to two domains.

1.2. Related Works

In this section, we briefly introduce the state-of-the-art methods for transferable representation learning.Theauthors in [4] present a novel method to learn deep networks for domain adaptation. The proposed method maps the outputs of all layers of deep networks to reproduce kernel Hilbert spaces and tries to match the distributions of these layers’ outputs of target and source domains. Moreover, the kernel space mapping is conducted by applying multikernel learning, where the optimal kernel function is a weighted linear combination of multiple kernels. This is different from the conventional transfer learning methods, which only matches the distributions of outputs of the last layer of source and target domains. The mismatching of the two distributions is measured by the maximum mean discrepancy criterion, which actually minimizes the squared Euclidean distance between the mean outputs of the source and target domains of the corresponding layer.The authors in [6] proposed to learn a domain transfer learning model for feature space independent domains. In this case, the source domain and target domain have completely different feature space. The source domain data points are mapped to target domain data points. The mapping guarantees that any source domain data point is mapped to a target domain point with the same class label. The target domain and source domain (mapped to target domain) are both represented by kernel matrices. To measure how well the two domains are allied, the Hilbert Schmidt Independence Criterion is applied. It calculates the trace of the product of the target domain kernel matrix and the mapped source domain kernel matrix. By maximizing the trace of the product of target and source, the distributions of the source and target domains are aligned and matched.The authors in [1] proposed a novel method for transfer learning. It selects the data points from the source domain for the target domain learning problem. To be specific, it assigns a weight to each source domain data point, which plays the role of selective weight. This weight has two functions. The first function is to select important source domain data points to represent the source domain to match the target domain. Instead of calculating the mean vector of the source domain features, this method calculates the weighted mean and matches it to the target domain by the maximum mean discrepancy criterion. The second function is to weight the loss function of the target domains. The target domain data points’ weights and the classifier parameters are also learned simultaneously in an iterative algorithm.The authors in [2] developed a novel multikernel classifier for domain transfer learning. It constructs the kernel function by multikernel combination with learned weights. Meanwhile, the kernel weights and classifier parameters are learned simultaneously. To match the source domain and target domain, the data points of two domains are mapped to a nonlinear Hilbert space, and their distributions are matched in this space. The learning algorithm minimizes the classification losses over the labeled data points of both domains and the squared Euclidean distance between the mean multikernel representations of tow domains under the maximum mean discrepancy criterion.

1.3. Our Contributions
1.3.1. Motive

All the above methods are based on the matching of two domains’ distributions of the data representations. The two key components of this method are the representation of distributions and the metric of the mismatching of the two distributions. In this paper, we give up this framework and propose a completely different framework for domain transfer learning. We observed that for an ideal representation model across two different domains, from its output of one data point, we cannot tell which domain it is from. At the same time, we can separate it from its true class and the other classes according to its output of the cross-domain representation model. It means that the representation of a data point is independent of its domain, but closely relevant to its class. Thus, instead of measuring the mismatch of source and target domain distributions, we measure the independence of the representations and domain-belonging indicators of the data points. To measure the dependency between the representation and the domain indicator, we employ the mutual information. By minimizing the mutual information between them, we learn the domain-independent representations. Meanwhile, we also propose to maximize the margin of each data point so that it can be separated from data points from other classes and kept close to the data points from the same classes.

1.3.2. Our Method

Motivated by the above ideas, we propose a novel deep learning model for the representation of data points of transfer learning problems. Firstly, to enhance the ability to discriminate data points of different classes, we propose to learn a unique deep convolutional network for each class, named classwise convolutional representation model. This is different from traditional domain transfer representation models, which learns a common model for all classes. To make the outputs of this model independent from the domain indicators, we propose to minimize the mutual information between the representation model outputs and the domain indicators. The mutual information estimation is based on the probability of representations and conditional probability of domain indicators given representations. We develop novel estimators for the conditional probability of domain indicators given representations. The estimator is defined over the neighborhood of the data point of the given representation, and it calculates the normalized summation of the soft weights of the data points from the input domain. To make the outputs of the model to be discriminate, we proposed to maximize the margin of each data point in the corresponding class. The margin is defined as the difference between intraclass dissimilarity and the interclass dissimilarity. The intraclass dissimilarity is defined in an intraclass neighborhood which contains a set of neighboring data points from the same class, while the interclass dissimilarity is defined in an interclass neighborhood which contains a set of neighboring data points from the other classes. To learn the representation model parameters, we build a unified learning framework. The objective function is defined by combining the margins, mutual information, and a squared norm term to control the complexity of the model. An iterative algorithm based on Adam is developed to solve the problem.

Remark. The overall diagram of the proposed learning framework for each class is given in Figure 1. As we can see from the figure, for each CNN model, its outputs are regularized by two types of auxiliary information: the domain indicator and the class label. Our framework calculates the mutual information between the CNN representations and the indicators of domains and minimizes it. Meanwhile, it calculates the margin from class label and maximizes it. In this way, this framework enables the CNN model to be discriminative and insensitive to the diversities of domains.
Our contributions are of three folds:(1)For the first time, the idea of learning cross-domain representations which are independent of domains is proposed for transfer learning. Instead of learning representations and making the distributions of two domains’ representations to match each other, we directly learn representations which are independent of their domain-belonging indicators. The mutual information is used to measure such dependency of representations and domain indicators, and it is minimized to seek the domain-independent representations.(2)We develop a novel and practical representation learning method to minimize the mutual information between the data points’ representations and domain indicators. The mutual information between representations and domain indicators of data points is estimated according to the probability of representation and the conditional probability of domain indicator given representation. We estimate the conditional probability of a data point’s domain indicator given its representation over its neighborhood. It is calculated as a summation of the normalized Gaussian kernel based similarity measured of the data points in the neighborhood but from the considering domain.(3)We propose a novel transfer learning framework for learning domain transfer deep representation models. It is a classwise model and we learn the parameters by simultaneously maximizing the margin of each data point of this class and minimizing the mutual information between the data points’ representations and their domain indicators. An iterative algorithm is developed to learn the optimal representations and the parameters of the model to output these representations.

1.4. Paper Organization

The paper is organized as follows: in Section 2, we introduce the proposed method in detail, including its mathematical modeling, problem optimization, and iterative algorithm design. In Section 3, we evaluate the proposed method over several transfer learning benchmark datasets to compare it against state-of-the-art transfer learning methods. In Section 4, we give the conclusion of this paper and some future works.

2. Methods

2.1. Definition of Symbols

In this section, we give a list of detailed definitions of the symbols used in the following sections.

2.2. Problem Modeling

We assume we have a set of n training data points, denoted as , where is the i-th data point, which is composed of instances, and is the j-th instance of the i-th data point. For the computer vision problem, a data point is an image, and an instance is an image patch, while for the natural language process problem, a data point is a sentence, and an instance is the embedding vector of a word.

2.2.1. Large Margin Class-Specific Convolutional Representations

We consider a classification problem of L classes, the training set can be divided into subsets of L classes and a set of unlabeled data points whose class labels are not known yet. The training set can be denoted as follows:where is the subset of the l-th class and is the subset of the unlabeled data points.

For the l-th class, we learn a class-specific deep CNN model to represent the data point X,which outputs a vector of m dimensions as the class-specific convolutional vector, and presents the parameters of the model.

Remark. We choose to learn the convolutional representations due to the following two reasons:(1)CNN model is good at extracting local patterns by utilizing a large number of sliding local filters, while in most domain transfer applications discussed in this paper, the local patterns play the most important role. For example, in the cross-domain image categorization task, for two images of different domains but containing the same object, the CNN model can capture the local region of the object with some local filters while ignoring the contexts which may vary in different domains. Another example is the text-related tasks; long sentences of the same topic may have different linguistic styles of different domains, but still contains short phrase, which could be captured by the CNN model effectively by its sliding local filters to extract features from short phrases.(2)CNN model, compared to the other deep learning models, such as recurrent neural network (RNN), has a more efficient training process. The CNN model has a parallel structure, and the responses of a sliding filter are calculated independently from each other; thus, its computing can be easily paralleled by GPU. This is different from the RNN model which has a sequential structure, where the response of a node is calculated based on the response of the previous nodes, which makes its computing time longer than the CNN model.Naturally, we hope the class-specific convolutional representations can separate the data points of the l-th class and the other classes as far as possible so that the classification performance can be improved. To this end, we propose to learn to discriminate convolutional representations for the data points of l-th class, by maximizing the local margin of each data point in this class. For the local margin of a data point in the l-th class, is defined by its intraclass neighborhood and interclass neighborhood. The intraclass neighborhood is the set of κ nearest neighboring data points in the same class, :where the norm distance between their l-th class-specific convolutional representations is used to measure the distance of neighbors. Meanwhile, the interclass neighborhood is the set of κ nearest neighboring data points from a different class:Note that to search for the interclass neighbors of , we use the convolutional representations of its class, the l-th class, even for the data points of the other classes. We further calculate an affinity measure for and a data point from , according to their class-specific convolutional representations and a Gaussian kernel function:where .
Similarly, we also calculate the interclass affinity between and data points in :The local margin of is defined as the difference between the weighted intraclass convolutional dissimilarity and the interclass dissimilarity:We proposed to maximize the local margin to improve the ability to separate data points of the l-th class from the other classes. To this end, we minimize the following objective function of margins over the data points of the l-th class to learn the convolutional representation network parameters:

2.2.2. Minimum Mutual Information Domain Adaptation

Since we are considering the problem of domain transfer learning, the training data points are from a source domain and a target domain, and we denote the source domain training set as and ; thus, . We introduce a domain indicator for each data point to present which domain it is from, , where indicates that is a source domain data point, while indicates that it is a target domain data point. Naturally, we hope the classwise convolutional representations of the source and target domains are mapped to a common space of the same distribution. To this end, we impose that the representations of the data points and their domain indicators are independent of each other so that from the presentation, we cannot measure which domain it is from. To measure the mutual dependence between the classwise representation z and the domain indicator π, we proposed to use the mutual information between them, .

Remark. According to the probability theory and information theory, the mutual information between two variables is a measure of the mutual dependence between them. For two variables, x, and y, the definition of mutual information of x and y is calculated by the double integral as follows:where is the joint probability function of x and y and is the probability function of x(y). For the discrete variables, the mutual information is calculated by the double sum:According to the mutual information’s relation to Kullback–Leibler divergence,where is the Kullback–Leibler divergence between and and is the conditional probability of x given y. Following equation (11), the mutual information between π and z is defined as follows:To estimate the mutual information over the training set, we propose to recalculate as follows:In the following, we discuss how to estimate the conditional probability of domain indicator given the convolutional representation, , and the probability of the convolutional representation, .

2.2.3. Estimation of

To estimate the probability of π given a data point, we propose to calculate the density of π over the neighborhood of . is the set of k-nearest neighbors,and the probability of π over is calculated as the empirical distribution,where if x is true, otherwise 0. According to equation (15), is the weighted summation of over , and the weights are hard weight . We release the calculation of the weights as soft weight according to Gibbs distribution as follows:

The weights satisfy the constraints of and .

2.2.4. Estimation of

We assume the convolutional representations are evenly distributed; thus, we use a simple empirical distribution function to calculate the probability of as follows:

Substituting equations (17) and (15) in (13), we rewrite the mutual information between variables z and π as follows:

To simplify the equations, we introduce the following variables:so that

We rewrite equation (18) with , , and as follows:

To learn a cross-domain representation to map the data of both domains to common space, we reduce the dependency of the domain indicator and the convolutional representation variables as much as possible. Since the mutual information measures the dependency, we minimize as follows:

In this way, we hope that the learned representations are independent from the domains as much as possible so that it can be generalized to adapt to both domains.

To construct the learning framework for the domain adaptation problem based on the classwise deep CNN representation model, we combine the objects of equations (8) and (22) for the minimization problem:where term is used to control the complexity of the model to prevent the overfitting problem and and are the tradeoff parameters. In the objective, the first term is a large margin corresponding term, while the second and third terms are corresponding to the entropies of location distribution of convolutional representations over the neighborhood specified by source and target domains. The fourth them is corresponding to the entropy of the overall location distribution of representations.

2.3. Optimization

It is difficult to solve the problem of equation (23) because the classwise representations are the outputs of a deep CNN function, , while it also defines the neighborhoods and affinities. To solve the problem of equation (23), we treat the representation as slack variables and introduce the following optimization problem:

To solve this problem, we use the ADMM algorithm. Following ADMM, we have the following optimization problem:where is a dual variable for the constraint and ρ is its penalty parameter. We solve this problem by alternately updating the variables in an iterative algorithm.

2.3.1. Updating of

The updating of is conducted by solving the following ionization problem:and we solve it by a gradient descent method:where ρ is a descent step parameter and is the gradient function regarding ,

2.3.2. Updating of

Updating of is conducted by solving the following minimization problem:

We also use the backprorogation algorithm to solve this problem, based on the chain rule:

2.3.3. Updating of

The dual variable is updated by gradient ascent:

2.4. Overall Learning Algorithm of MMITR

In this section, we give the overall iterative learning algorithm of the proposed minimum mutual information transfer representation (MMITR) method. In this algorithm, it has an updating strategy similar to the expectation-maximization (EM) algorithm. In each iteration, we firstly fix domain transfer representations to update the inter- and intraclass affinity metrics in an E-step and then fix the inter- and intraclass affinity metrics to update the CNN parameters and the representations in an M-step. The iterations are stopped until a maximum iteration number is reached or the objective value reaches a threshold. The overall algorithm is described in Algorithm 1.

Input: Training set of L classes and unlabeled data points ;
Input: Domain indicators of training points ;
Input: Tradeoff parameters and ;
Input: Maximum number of iterations, η;
Input: Objective value threshold, ε.
Initialize iteration indicator .
Initialize model parameters and objective value .
while or objective value do
E-step: Update the inter- and intraclass affinities for each data point, according to equations (5) and (6).
M-step: Iterating the ADMM updating steps.
for do
  Update the domain transfer representations by iterating the gradient descent steps of equation (27).
  Update the CNN model parameters by iterating the backprorogation steps of equation (30).
  Update the dual variables by iterating the gradient ascent steps of equation (31).
endfor
.
endwhile
Output: W and .
2.5. Prediction of a New Data Point

When we have a new data point, X, to classify it, we calculate its classwise representation and corresponding margin regarding each class:where the intra- and interclass neighborhood and affinity are calculated according to the classwise representations. The new data point is assigned to a class which gives the maximum margin:

3. Experiments

3.1. Datasets

In our experiments, we use the following datasets as benchmark datasets:Office-31: this dataset has 4,652 images of 31 classes. The images are from three different domains: Amazon (images downloaded from http://www.amazon.com), Webcam (photos taken by web camera), and DSLR (photos taken by digital SLR camera).ImageCLEF-DA: this dataset is composed of images of 12 classes of four domains. Each domain is a unique database, including Caltech-256, ImageNet ILSVRC 2012, Pascal VOC 2012, and Bing.Email spam: this dataset has email texts of spam and nonspam. The data are collected from three users, and each user has 2,500 emails. Each user is treated as a domain.Extended Cohn-Kanade (CK+): this dataset is an image set for facial expression recognition. It has images of 123 subjects, and each subject is treated as a domain. This dataset has 593 videos in total, and for each video, there are about 20 frames. Each image of a face belongs to one of the 7 expression classes.Amazon: this dataset is a text classification task dataset. The texts are from three different domains, and each domain is the review for products of books, DVD, and music. For each domain, there are 2,000 positive review texts and 2,000 negative review texts.

3.2. Experimental Setting

In this experiment, we use each domain of a dataset as a target domain in turn and the remaining domains as the source domains. For each test domain, we use the leave-one-out protocol to split it into a training set and a test set. Each data point of a target domain is used as a test data point in turn, and the remaining data points are combined to form a training set. The training set of the target domain is randomly split to an unlabeled set and a labeled set, with equal size. The data points of the source domains are always treated as labeled in our setting. Our algorithm is performed over the training set to learn the parameters of the classwise representation model and the domain-independent representations and then used to classify the test data points. The classification accuracy is used to evaluate the performance of the algorithm.

3.3. Results

In our experiment, we first study the properties of the algorithm experimentally, including its sensitivity to the tradeoff parameters and its convergence property to iteration numbers. Then, we compare its performance to state-of-the-art domain transfer learning algorithms.

3.3.1. Algorithm Property Evaluation

(1) Sensitive to Tradeoff Parameters. There are two tradeoff parameters in our algorithm: and . They are the weights of the mutual information term and the complexity reduction term in our object. We plot the accuracy of our algorithm regarding different values of , as shown in Figure 2. From this figure, we observe that the accuracy improves in most cases when the value of increases. Since is the weight of the mutual information term to measure how dependent the representation is from the domain, this indicates that a more domain-independent representation helps the classification in the target domain. Actually, the more independent the representation is from the domain, the better the data of different domains are merged. Thus, the source domain can benefit the learning problem in the target domain more. This phenomenon is even more obvious in the CK + dataset; when grows from 1 to 10, the accuracy is boosted significantly. This is a piece of strong evidence how the minimum mutual information improves the transfer learning.

The accuracy curves of classification with different values of are shown in Figure 3. From this figure, we can see that the proposed algorithm is stable to the change of . Since the algorithm is not sensitive to the change of , the tuning of this parameter will be easy for a specific dataset. One only exception is the case when varies between 1 and 10, the accuracy changes dramatically.

(2) Convergence Analysis. Since our algorithm is an iterative algorithm, it is critical to know when to stop the iterations. We study the convergence of the algorithm by plotting the accuracy over different datasets with varying numbers of iterations in Figure 4. According to the curves in the figure, in most datasets, the algorithm gives a better accuracy when the iteration number grows and then becomes stable after about 100 iterations. For the email spam dataset, the algorithm converges at 50 iterations.

Remark. To solve a minimization problem in our algorithm, we employed the ADMM algorithm. To verify if the ADMM algorithm solves the optimization of the minimization problem effectively, we plot the object values of the learning problem with increasing numbers of iterations in Figure 5. As we can see from the curves, the object value decreases stably as the number of iterations increases, until it reaches a convergency, and then the changing of object values becomes small. This is a strong evidence that ADMM algorithm solves the optimization problems effectively (Table 1).

3.3.2. Comparison to State of the Arts

We compare our algorithm, MMITR, to several state-of-the-art transfer learning algorithms, including the Deep Adaptation Network (DAN) [4], Selective Transfer Machine (STM) [1], Semisupervised Kernel Matching Domain Adaptation (SSKMDA) [6], and Domain Transfer Multiple Kernel Learning (DTMKL) [2]. In Table 2, we have provided a detailed list of algorithms compared in the experiment, regarding the aspect of data representation components and domain matching criteria.

The comparison of accuracy results is given in Figure 6. In this figure, we can observe that the proposed method outperforms the other methods in four experiments out of five. In experiments over three datasets (Office-31, ImageCLEF-DA, and Amazon), our algorithm outputs the second best method, DAN, by a large margin. For the dataset of CK+, DAN outperforms our method by a slight amount. Both DAN and our method MMITR are based on deep learning model, but our method tries to learn domain-independent deep representations, while DAN tries to learn a deep learning model to represent the data points so that mean of representations of source domain and target domain can be similar to each other. According to the results, MMITR outperforms DAN in most cases; we conclude that domain-independent deep representation is more suitable for domain transfer learning than domain-mean matched representation. The other methods are also based on mean-matching of domain transfer representations, but using a shallow model instead of a deep model; thus, they are not able to explore hierarchal deep features. This again verifies the effectiveness of the deep model.

Remark. The conditions of the methods of the results reported in 6 are described in details as follows. For the DAN algorithm, it has two hyperparameters: the MMD Penalty λ and the Entropy Penalty γ, and we set their values to 1 and 0.1, respectively. For the STM algorithm, there are two hyperparameters: C for the tradeoff between maximal margin and training loss and λ for the tradeoff between the SVM empirical risk and the domain mismatch loss. In this experiment, we set both their values to 1. For the SSKMDA algorithm, it has five tradeoff parameters between the model components: , and their value setting in our experiment are 10, 2, 0.1, 0.1, and 1, respectively. The DTMKL algorithm has only one hyperparameter: the regularization parameter C, and we set it to 0.5 in the experiments. For our algorithm MMITR, it has two tradeoff parameters, and ; for each benchmark dataset, we report the best results among the results obtained by using different values of and .

4. Conclusions and Future Works

In this paper, we proposed a novel framework for transfer learning. Not like the traditional transfer learning which tries to match the representations of source domain and target domain, we proposed to learn domain-independent representations. We argue to measure the dependency of learned deep representations and domain by mutual information and learn the domain-independent deep representations by minimizing the mutual information. We also proposed a practical estimation method for the mutual information between domain and deep representations. A classwise deep representation neural network work is trained under this framework and used to classify new data points. Experiments over benchmark datasets for transfer learning verify the effectiveness of the proposed method.

Remark. The new concept proposed in this paper is a novel domain transfer learning framework which minimizes the mutual information between the domain transfer representations and the domain indicators so that the gaps among domains can be effectively leveraged and a common representation space is learned. The new method developed in this is a novel iterative learning algorithm to learn the domain transfer representations based on CNN models.

Data Availability

All the datasets used in this paper are publicly accessed.

Conflicts of Interest

The authors declare that they have no conflicts of interest.