Generalization Bounds Derived IPM-Based Regularization for Domain Adaptation
Domain adaptation has received much attention as a major form of transfer learning. One issue that should be considered in domain adaptation is the gap between source domain and target domain. In order to improve the generalization ability of domain adaption methods, we proposed a framework for domain adaptation combining source and target data, with a new regularizer which takes generalization bounds into account. This regularization term considers integral probability metric (IPM) as the distance between the source domain and the target domain and thus can bound up the testing error of an existing predictor from the formula. Since the computation of IPM only involves two distributions, this generalization term is independent with specific classifiers. With popular learning models, the empirical risk minimization is expressed as a general convex optimization problem and thus can be solved effectively by existing tools. Empirical studies on synthetic data for regression and real-world data for classification show the effectiveness of this method.
The generalization ability is a main concern of statistical learning theory . How to improve the predicting accuracy under the empirical risk minimization (ERM) principle has practical meaning since ERM-based learning process is widely used nowadays. As one important technique to improve generalization ability or avoid so-called overfitting, regularization plays a crucial role to maintain the trade-off of the empirical loss and the expected risk. Different regularizer may acquire different performance, and the choice depends on the specific purposes.
For traditional supervised learning, many labeled data are needed for training a precise model. It is well-known that annotating is both labour and time consuming with large amounts of unlabeled data. Another underlying assumption is that training data and testing data are separately provided while drawn from the same distribution; thus we can use the model trained on the former to predict labels of the latter, while the real situations we may always confront are that the available labeled data are from different sources and are different from what we need to predict. In other words, labeled data from target domain are not always accessible or sufficient. As a consequence, the provided labeled data cannot be trained directly to gain predictors on the target data.
As an efficient method to utilize small number of labeled data, or even unlabeled data from other sources, domain adaptation has obtained more attention in recent years [2–4]. Patterns from source domain and target domain are utilized to acquire better predictive ability on target data. Learning from multiple source domains  and combining source and target domains  are popular methods proposed in recent years. Along with some successful application related to domain adaptation, several works focused on the learning ability on this paradigm. Specifically,  studies the generalization bounds of domain adaptation, in which the integral probability metric (IPM)  is chosen to measure the distance between the source domain and the target domain. A natural idea is how to combine the theoretical results and the practical algorithm designing, thus creating more efficient learning algorithms.
In this paper, we proposed a framework for domain adaptation combining source and target data, taking the IPM as the regularization term. Since the IPM is defined as the upper bound of the gap between two distributions (source domain and target domain), the regularization term is independent with specific predictors. In other words, many popular learning models can be used under such a framework. For many cases, the empirical risk minimization problems could be solved efficiently as convex optimization problems in considerable times.
The remainder of this paper is organized as follows. Section 2 reviews related works about theoretical analysis of domain adaptation problems and a regularized domain adaptation framework. Section 3 introduced the problem set-up of and the derived IPM-based generalization bounds. We propose the framework in Section 4 and report the experimental results of regression and classification in Section 5. Section 6 concludes this paper.
2. Related Works
There have been many works focused on the theoretical analysis of domain adaptation. Generally speaking, the generalization performance is measured by the size of training set, complexity of function class, and several constants. Specifically for domain adaptation, one also needs to measure the divergence of different distributions. For the complexity measurement of function class, VC-dimension is widely used in traditional learning model as well as in domain adaptation [4, 6, 9]. Besides VC-dimension, the covering number and Rademacher complexity are also used to measure the function class in generalization bounds of domain adaptation [5, 7]. In terms of the measurement of different distributions, -divergence is used in [4, 6]; the same concept is called -distance in  and derived from . It was defined as the upper bound of two probability distributions, which is straightforward for classification. Both  and  introduce different quantities for more general tasks including regression, while the latter further take the labeling function into consideration.
One significant meaning of theoretical analysis is to provide guidance of designing new algorithms. Most of the above works give out the generalization bounds of domain adaptation to provide important properties of learning process for domain adaptation instead, such as convergence rate, effectiveness, and correctness.
In terms of regularized domain adaptation, a framework called domain adaptation machine (DAM) [11, 12] describes a data dependent regularizer, which is based on smoothness assumption and a relevance between source domain and target domain. The framework is similar to our method in some way, while the definition and optimization are different. DAM mainly stresses domain adaptation from multiple sources, while we care about domain adaptation combining source (including multiple sources) and target data, which has different empirical loss as well as regularizer. However, the one regularizer in DAM has close connection with ours and the details can be found in later discussion.
3. Domain Adaptation
3.1. Problem Description
In domain adaptation, the source domain and target domain are denoted by and . Distributions over input space and are donated by and , respectively. Traditional supervised learning aims to learn a function for labeling unseen samples in . In the domain adaptation set-up, is hard to estimate directly with insufficient . With considerable amounts of and , the minimization empirical risk over loss function with parameter vector can be expressed as follows:where is the expectation taken with respect to the distributions . In order to utilize more information of target domain, available target samples should be used. Given , domain adaptation combining source and target data is defined to minimize the empirical risks :where controls the trade-off between learning from source data and target data.
3.2. Integral Probability Metric
In domain adaptation, it is important to find a quantity measuring the difference of the distributions between the source and the target domains. In this paper, we use the integral probability metric (IPM) to measure the difference between two distributions. This quantity is defined as the distance between the source domain and the target domain , under function class :The quantity is aimed at measuring the difference between the two probability distributions. If the source domain and the target domain have the same probability distribution, the quantity is equal to zero.
Assuming there are samples drawn from source domain and samples from target domain, the expectations and can be roughly estimated by these samples; thus the can be approximated by the expectations over given data. However, the target samples are not enough to learn a predictor; that is, ; then domain adaptation minimize the convex combination of the source and the target empirical risk, for ,When , it provides a learning process of the basic domain adaptation with one single source.
3.3. Generalization Bounds
The generalization bounds of a learning process need to consider three essential aspects: complexity measure of function class, Hoeffding-type deviation inequality, and symmetrization inequality.
Different from the classical VC-dimension form, Zhang et al.  chose the uniform entropy number to measure the complexity which is derived from the concept of the covering number . The covering number is denoted by , where is the function class, is a metric on , and the covering number of at radius with respect to is the minimum size of a cover of radius . The covering number is not suitable for domain adaptation. As a variant of the covering number, by setting the metric , the uniform entropy number is defined as follows:The uniform entropy number is distribution-free and can be chosen as the complexity measure of function class to derive the generalization bounds for domain adaptation.
Hoeffding-type deviation inequality for domain adaptation is an extension of the classical Hoeffding-type deviation inequality which allows the random variables to take values from different domains. It is assumed that is a function class consisting of bounded functions with the range . A function is defined as follows:For any and any ,where the expectation is taken on both the source domain and the target domain .
Symmetrization inequality for domain adaptation has a discrepancy term compared to the classical symmetrization result under the assumption of the same distribution. For any , the probability of the eventcan be bounded by using the probability of the eventwhere .
Based on the uniform entropy number, using a specific Hoeffding-type deviation inequality and symmetrization inequality, the generalization bounds of domain adaptation combining source and target data are derived as follows.
Assume that is a function class consisting of the bounded functions with the range . For any and given an arbitrary , we have, for any , with probability of at least ,The derived bound contains a term of discrepancy quantity .
4. IPM-Based Regularization Framework
From formula (10), we can see that the generalization bounds of domain adaptation consisted of two parts: integral probability metric (IPM) and the extension of the covering number (referred to as the uniform entropy number). Since the IPM is relatively easy to compute with source data and target data available, it is straightforward to take this term into regularization to reduce generalization error. Besides, it is also intuitive to make full use of target information to construct predictors. For single source, given data and corresponding label (or target value for regression) , take as the parameters of model and as the loss of a single sample. The general objective function for supervised learning can be written in the following risk minimization problem:where is the regularizer and is the balancing parameter.
Based on the definition of IPM (3), empirical risk (4), and learning principle (11), we formally propose the framework of domain adaptation combining the source and the target data by replacing the regularizer. Considerwhere .
In , the IPM can be empirically estimated by various popular distance metrics by appropriately choosing . Specifically in the reproducing kernel Hilbert space (RKHS), IPM is called kernel distance or maximum mean discrepancy (MMD) . The empirical estimator of MMD is straightforward:where is called a feature space mapping function and two feature maps are defined as the kernel, .
DAM frameworks  construct a domain-dependent regularizer for domain adaptation from multiple sources, which is defined aswhere is the number of source domains, and are the decision values from the target classifier, and the th classifier on the unlabeled instances in the target domain. Here the coefficient is set as .
From the definition we can see that the regularizer we use in (12) is much simpler than that in DAM. Moreover, the objective function in DAM consists of three parts, other two include the regularizer which controls the complexity of target classifier and the loss of target classifier, while the objective function we use in (12) considers a combination of the loss over source domain and target domain .
The proposed framework is also suitable for domain adaptation combining multiple sources, where and regularization term in (12) are defined as a linear combination of several terms. ConsiderThe generalization bound of domain adaptation from multiple sources has similar form with (10), where the first term on the right side is a linear combination of several IPMs instead of one; see (16).
We first carry out experiments on both simple regression and classification problems to verify the effectiveness of (12). For the purpose of easy-to-optimize, we use least square as the loss function. It is straightforward in regression since the target value is continuous, while for binary classification there are a few articles that discussed this loss. Reference  employed it in text classification and  pointed out the rationality of least square loss compared with SVM. Since the loss is quadratic while the IPM is expressed as an absolute value under this setting, it is necessary to convert the regularizer into the squared form of the original value to balance these two terms, and it can be approximated by the gap of losses on target domain and source domain, that is, . All these tricks make the whole objective function consisting of both loss function and regularizer convex much easier to optimize. We use the limited-memory BFGS provided by package yagtom (https://code.google.com/p/yagtom/) in experiments.
In the last part of experiment, we would apply least squares support vector machine (LS-SVM)  as the classifier; the loss function is expressed as , where is the kernel function. Regularization for LS-SVM is commonly used, , where parameter controls the balance.
We perform numeric experiments on synthetic data for regression test and only consider single source. For target domain, we assume from a Gaussian distribution and the noise vector with ; let model parameters vector of ; then the target values are generated by
The derived will be used in training and cotraining with data from source domain, and will be used as the test data. Similarly, the sample set will be used as source domain and the generating rule iswhere , , and .
With the fitting accuracy root mean squared error (RMSE) as the criterion, we conducted the following four settings in the experiments:(i)setting 1, : training on the small parts of target domain () and testing on target domain ();(ii)setting 2, : training on the source domain () and testing on target domain ();(iii)setting 3, : training on the source domain () combining small parts of target domain () and testing on target domain ();(iv)setting 4, : training on the source domain() combining small parts of target domain () with regularizer and testing on target domain ().
We search the parameter in range of in setting 4 and in setting 3 and setting 4 according to the similar numeric experiments to evaluate the asymptotic convergence in . 10 rounds for each problem have been conducted and the average of RMSE is recorded as the result. All the results are shown in Table 1.
We can see, in all cases, that RMSE in setting 4 is the smallest. It makes sense to say that the domain adaptation with the IPM regularizer can obtain better performance than without it.
When adopting square loss function in binary classification, we require the sample ’s label . Assume the output label of is ; in case that the predicting is right.
The binary classification tests are carried on text datasets email spam (available at http://www.ecmlpkdd2006.org/challenge.html) and parts of 20 newsgroups datasets (http://vc.sce.ntu.edu.sg/transfer_learning_domain_adaptation/). The email spam dataset contains a set of 4000 public labeled emails which is used here as target domain data and other three sets, each of which has 2500 emails annotated by different users and would be used as source domain data. In these four datasets, samples are labeled as nonspam () or spam emails (). The 20 newsgroups datasets recollected by Duan et al.  contains three groups and each has a target set with three sources. Details of the datasets used in classification are shown in Table 2.
So we have 12 groups of source-target pairs in total to conduct the experiments; in each pair we randomly choose samples from the target domain to participate in domain adaptation and the classification accuracy on the rest target set is chosen as the evaluation criterion. The parameters and are picked in the same way as in the regression experiment, and result in each pair is averaged over 10 times running. The comparison of classification accuracy is listed in Table 3.
As we can see, the domain adaptation with the IPM regularizer can obtain better performance than without it and is even better than just training on small target domain samples in most cases.
5.3. Classification with LS-SVM
In order to improve the classification ability in real datasets, we adopt LS-SVM with kernel as the predictor. The square of MMD is easily obtained by (19), by expanding the original definition. Here in the experiments we use linear kernel for convenience of getting MMD (13); that is, . What is more, the regularization term is independent with the model parameters. Consider
In this part, we adopt a paradigm of domain adaptation combining multiple sources. As a consequence, in settings 2, 3, and 4, the risk on source domain is computed by (15) and in setting 4 the regularization term IPM is computed by (16) and (19). In each problem, there are three sources. First of all, we search the regularization parameter in single LS-SVM predictor, that is, of (11), in range , on the 20 newsgroups datasets. We can see from Figure 1 that the proposed method tends to achieve best testing accuracy and low standard deviations. In all datasets with any value of , setting 1 has the lowest testing accuracy and relatively high standard, due to the insufficient training with small amounts of labeled data. As in most cases, has the best performance; we set this value in the following experiments.
All results on the same datasets listed in Table 2 are shown in Table 4. We can see that in most cases, the proposed algorithm outperformed other methods from a statistical perspective. Setting 1 had the worst accuracy, which means training on small amounts of target data is not sufficient. The fact that accuracy in setting 1 increases as the available labeled data becomes more, which fits the experience of ERM learning. It seems that the performance of setting 2 is even slightly better than setting 3 in most cases; thus simply combining risks over source and target domain to learn may not work in practice. On the other hand, the IPM regularization term does provide a bridge between this gap.
In this paper, we proposed a general framework for regularized domain adaptation combining source(s) and target data. The regularization mainly considers the gap between source domain and target domain and uses the integral probability metric as the distance measurement of different domains. Square approximation and inner product in RKHS tricks are used for empirical estimation of the IPM. The IPM regularization term is supposed to reduce the generalization error according to a theoretical work . The regularization method can work for domain adaptation combining single source as well as multiple sources, and a sort of popular predictor can be utilized. Experiments on regression and classification indicate that this method can work better than original domain adaptation without the regularization term.
We are also interested in the relationship between semisupervised learning and domain adaptation with few labeled target domain samples, since they share similar problem settings. And for cases when labeled target data is unavailable, the obtained pseudolabel may help. Theoretical analysis and empirical results are going to be investigated.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This research was supported by the National Technology Research and Development Program of China (863 Program) 2012AA01A510.
V. N. Vapnik, Statistical Learning Theory, Wiley, 1998.
P. Wu and T. G. Dietterich, “Improving SVM accuracy by training on auxiliary data sources,” in Proceedings of the 21st International Conference on Machine Learning (ICML '04), pp. 871–878, ACM, July 2004.View at: Google Scholar
J. Blitzer, R. McDonald, and F. Pereira, “Domain adaptation with structural correspondence learning,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '06), pp. 120–128, Association for Computational Linguistics, July 2006.View at: Google Scholar
Y. Mansour, M. Mohri, and A. Rostamizadeh, “Domain adaptation with multiple sources,” in Proceedings of the 22nd Annual Conference on Neural Information Processing Systems (NIPS '08), pp. 1041–1048, December 2009.View at: Google Scholar
J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman, “Learning bounds for domain adaptation,” in Advances in Neural Information Processing Systems, pp. 129–136, 2008.View at: Google Scholar
C. Zhang, L. Zhang, and J. Ye, “Generalization bounds for domain adaptation,” in Advances in Neural Information Processing Systems, pp. 3320–3328, MIT Press, 2012.View at: Google Scholar
S. Ben-David, J. Blitzer, K. Crammer et al., “Analysis of representations for domain adaptation,” in Advances in Neural Information Processing Systems, vol. 19, p. 137, 2007.View at: Google Scholar
D. Kifer, S. Ben-David, and J. Gehrke, “Detecting change in data streams,” in Proceedings of the 30th International Conference on Very Large Data Bases, vol. 30, pp. 180–191, VLDB Endowment, 2004.View at: Google Scholar
L. Duan, I. W. Tsang, D. Xu, and T.-S. Chua, “Domain adaptation from multiple sources via auxiliary classifiers,” in Proceedings of the 26th International Conference On Machine Learning, (ICML '09), pp. 289–296, ACM, June 2009.View at: Google Scholar
Y. Yang and C. G. Chute, “A linear least squares fit mapping method for information retrieval from natural language texts,” in Proceedings of the 14th Conference on Computational Linguistics, vol. 2, pp. 447–453, 1992.View at: Google Scholar
R. Rifkin, G. Yeo, and T. Poggio, “Regularized least-squares classification,” in NATO Science Series, III: Computer and Systems Sciences, vol. 190, pp. 131–154, IOS Press, 2003.View at: Google Scholar