- About this Journal ·
- Abstracting and Indexing ·
- Aims and Scope ·
- Annual Issues ·
- Article Processing Charges ·
- Articles in Press ·
- Author Guidelines ·
- Bibliographic Information ·
- Citations to this Journal ·
- Contact Information ·
- Editorial Board ·
- Editorial Workflow ·
- Free eTOC Alerts ·
- Publication Ethics ·
- Reviewers Acknowledgment ·
- Submit a Manuscript ·
- Subscription Information ·
- Table of Contents

Abstract and Applied Analysis

Volume 2013 (2013), Article ID 715275, 7 pages

http://dx.doi.org/10.1155/2013/715275

## Least Square Regularized Regression for Multitask Learning

^{1}Department of Mathematics, Beijing University of Chemical Technology, Beijing 100029, China^{2}Department of Mathematics, Beijing University of Aeronautics and Astronautics, Beijing 100091, China^{3}Department of Systems Engineering and Engineering Management, City University of Hong Kong, Hong Kong

Received 11 October 2013; Accepted 13 November 2013

Academic Editor: Yiming Ying

Copyright © 2013 Yong-Li Xu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The study of multitask learning algorithms is one of very important issues. This paper proposes a least-square regularized regression algorithm for multi-task learning with hypothesis space being the union of a sequence of Hilbert spaces. The algorithm consists of two steps of selecting the optimal Hilbert space and searching for the optimal function. We assume that the distributions of different tasks are related to a set of transformations under which any Hilbert space in the hypothesis space is norm invariant. We prove that under the above assumption the optimal prediction function of every task is in the same Hilbert space. Based on this result, a pivotal error decomposition is founded, which can use samples of related tasks to bound excess error of the target task. We obtain an upper bound for the sample error of related tasks, and based on this bound, potential faster learning rates are obtained compared to single-task learning algorithms.

#### 1. Introduction

Multitask learning [1] is a learning paradigm which seeks to improve the generalization performance of a learning task with the help of some other related tasks. This learning paradigm is inspired by human learning activities in that people often apply the knowledge gained from previous learning tasks to help learn a new task [2]. Different multitask learning algorithms have been designed, such as multitask support vector machine (SVM) [3], multitask feature learning [4, 5], multitask clustering approach [6], multitask structure learning [7], and multitask gradients learning [8].

Multitask learning can be formulated under two different settings: symmetric and asymmetric [9]. The symmetric multitask learning tries to improve the performance of all tasks simultaneously, and the objective of asymmetric multitask learning tries to improve the performance of some target task using information from related tasks. The asymmetric multitask learning is related to transfer learning [10]. The major difference is that the source tasks are still learned simultaneously in asymmetric multitask learning while they are learned independently in transfer learning [2]. Much experimental work has achieved the target that is improving the prediction performance of a learning task with the help of other related tasks [1, 11–15]. However, there has been relatively little progress on theoretical analysis for these results.

Baxter presented a general frame for model selection in multitask learning environment [16]. They showed that a hypothesis space that performs well on a sufficiently large number of training tasks will also perform well when learning novel tasks in the same environment. They proved that learning multiple tasks within an environment of related tasks can potentially give much better generalization than learning a single task. Ando and Zhang considered learning predictive structures on hypothesis spaces from multitask learning [17]. They presented a general framework in which the structural learning problem can be formulated and analyzed theoretically and related it to learning with unlabeled data. Ben-David and Borbely defined relatedness of tasks on the basis of similarity between the example generating distributions for classification [18], and then they gave precise conditions under which bounds guarantee generalization on the basis of smaller sample sizes than the standard single-task approach. Solnon et al. studied multitask regression, using penalization techniques [19]. They showed that the key element appearing for an optimal calibration is the covariance matrix of the noise between the different tasks. They presented a new algorithm to estimate this covariance matrix and proved that this estimator converges towards the covariance matrix.

In this paper, we propose a least-square regularized regression algorithm for multitask learning with hypothesis space being the union of a sequence of Hilbert spaces. The relatedness of tasks is described by distributions that underlie these tasks and some property of the hypothesis space. We assume that the distributions are related by a set of transformations under which the norm of any Hilbert space in the hypothesis space is invariant. We design a multitask learning algorithm with two steps: firstly, samples of other related tasks are used to select an approximate optimal Hilbert space in the hypothesis space; secondly, in the optimal Hilbert space, we use standard least square regularized regression algorithm for the target task. It is proved that, under the above assumption, the optimal prediction function of every task is in the same Hilbert space. For error analysis, we decompose the excess error of prediction function in target task into regularization error and sample error in which the difference between error and empirical error of the prediction function in the target task is estimated by the average value of those in related tasks. This leads to a potential faster learning rate than that of standard regularized regression algorithm in single task.

The rest of the paper is organized as follows. In Section 2, we introduce some notions and definitions and then propose the least square regularized regression for multitask learning. In Section 3, we decompose excess error of target task into regularization error and sample error in which samples of other related tasks can be used to estimated difference between error and empirical error of the prediction function in target task. The main result is presented in Section 4. An upper bound for sample error of multiple tasks is given by Hoeffding’s inequality and an estimation for covering number in multitask learning, and then, based on the upper bound, potential faster learning rates compared to single-task learning algorithms are obtained.

#### 2. Preliminaries

To propose regularized regression algorithm for multitask learning, we introduce some definitions and notations. Let be a compact metric space and let . Let be a probability distribution on . The regression function is defined as where is the conditional probability measure at induced by . Knowing a set of samples from the probability distribution , our goal is to find a good approximation of .

For multitask learning, we define the relatedness of probability distribution of multiple tasks.

*Definition 1. *For a function , let be the probability distribution over defined by , for . Let be a set of transformations and let be probability distributions over . We say that and are -related if there exists some such that or .

Then, we describe the relatedness of the transformation set above and a hypothesis space.

*Definition 2. *Let be a set of transformations and let be a set of functions . We say that acts as a group over , if,(1)for every and every , there holds ;(2)for every , the inverse transformation and the composition are also members of .

*Definition 3. *Let be a Hilbert space with norm and let act as a group over . We say is norm invariant under , if, for any and any , there holds
To explain the above definitions, we give an example.

*Example 4. *Let , , , and , for , and be the closure of linear span of the set with inner product , for any . Let the norm of functions in be induced by the inner product.

Assume is a set of translation and rotation transformations on : for any and , there holds , for some , or , for
where is an angle dependent on . We can verify that acts as a group over and is norm invariant under , for any .

Now we introduce standard least-square regularized regression algorithm. We denote error and empirical error of a function with squared loss as follows. For a distribution on , the error of is defined as It is well known that the regression function minimizes the error. Indeed, where is the marginal distribution of on and . The above difference is called the excess error of , and for a sample set , independently drawn according to , the empirical error of is defined as

In this paper, we consider a sequence of related learning tasks. The goal is to use information of related tasks to improve learning performance of one special learning task. Let be a transformation set and be a Hilbert space with norm , for any , where is an index set. We assume that there is , such that

Let denote the probability distribution of the th task, for . We assume is pair-wise -related, acts as a group over , and is norm invariant under , for any . Let be samples independently drawn according to . Since the tasks are related, we try to use samples of all the tasks to improve the learning performance of the target task.

Standard least-square regularized regression algorithm associated with for the th single task is defined as the minimizer In the above optimization, there are two steps; firstly, for any fixed , find the optimal function ; secondly, find the global optimal function , for . Since the tasks are related, we try to use samples of all the tasks to improve the learning performance of the target task. Without loss of generality, we choose the first task as the target task.

Now, we propose least square regularized regression algorithm for multitask learning.

*Step 1. *Use samples of other tasks to select the approximate optimal as follows:

*Step 2. *In , search for the approximation to as follows:

#### 3. Error Decomposition

To estimate the bound of excess error , we introduce some notations. Let be the optimal for the *i*th task,

Lemma 5. *Let denote the probability distribution of the th task for , let be a Hilbert space with norm for , and let be a transformation set. Assume is pair-wise -related, acts as a group over , and is norm invariant under , for any . Then defined in (11) satisfies
*

*Proof. *By the assumption that and are -related, it is easy to show that
for any and . Notice that is norm-invariant under for any . Then, there holds
for any . Then the lemma follows.

To decompose the excess error, we give the following notations. Denote

Proposition 6. *Let be defined in (10). Then under the assumption of Lemma 5, can be bounded by
*

*Proof. *Write the regularization error as
By Lemma 5, for any , there holds . Then we obtain
By the definition, we have that
Therefore, we have
Then, the proposition follows.

In (16), there are terms. The first term is called regularization error which depends on the approximation ability of hypothesis space to . The estimation of this term has been discussed in [20] for reproducing kernel Hilbert space with Gaussian kernel with flexible variances.

The other four terms are called sample error. In the second and third terms, and are selected from which is dependent on , for , and is independent of . Therefore, when we take expectation of , can be seen as a fixed function space. Consequently, these two terms can be estimated with the same method in the proof of Propositions 2.1 and 3.1 in [21]. In the fourth term, is a fixed function, for . Therefore, this term can also be estimated as in the proof of Proposition 2.1 in [21]. The last term is more difficult to deal with because , for , can not be considered in , for any fixed . Consequently, when sample number , the convergence rate of the sample error depends on that of the last term. Therefore, in the following section, we focus on the estimation for the bound of the last term in (16).

#### 4. Error Analysis

In this section, we estimate the bound of the last term in (16). To bound this term, we have to estimate capacity of . Here, the capacity is measured by the covering number.

*Definition 7. *For a subset of a metric space and , the covering number is defined to be the minimal integer such that there exist disks with radius covering .

For and , define For and , define Then, define the distance from to as Let denote the minimal integer such that there exist parameters , such that Then, for , we have the following lemma.

Lemma 8. *Consider the following:
*

*Proof. *For any , there is such that . By the definition of , there is such that . Then, by the definition of , we have . And by the definition of , there is satisfying .

By the definition of , there is such that . Therefore, we can obtain
Note that, for any , there holds . Then the lemma follows.

Proposition 9. *For and , for , defined in (10), there holds
*

*Proof. *By the definition of , we have , for . Then, there holds
For , denote . Note that, for , there holds
Therefore, we can find that balls such that the center of each ball and point in this ball satisfies . Therefore, the probability in (28) can be bounded by following expression:
For the covering number , by Lemma 8, we have the estimate
Using Hoeffding inequality, for any fixed , we have
Then, the proposition follows.

Finally, we can obtain a bound for the last term of (16) by Proposition 9.

Proposition 10. *Let , for , be defined in (10). Assume, for all and , there holds
**
Then with confidence at least , there holds
**
where is the solution of the following equation:
*

*Proof. *Let
By condition on , we have
Then, is not larger than in the following equation:
Then by Proposition 9, the proposition follows.

*Remark 11. *Compare multitask learning with multiple Hilbert spaces to single task learning with multiple Hilbert space.

Recall that the least square regularized regression in for single task is defined as
can be bounded by the sum of regularization error and sample error with similar method in Proposition 6. In the sample error, the term most difficult to estimated will be , because function changed with the runs over function set . By the same method in Proposition 10, with confidence , this term can be bounded by the solution of the following equation:
Obviously, we have . Therefore, multitask learning algorithm has potential faster learning rate.

*Remark 12. *Comparing multitask learning with multiple Hilbert spaces to single task learning with single Hilbert space.

In this paper, we set the hypothesis space as a set of Hilbert spaces. It is well known that hypothesis space with more functions has stronger approximation ability and bigger complexity. Therefore, the regularization error may be smaller and sample error may be larger than that of algorithms with a single Hilbert space being the hypothesis space.

For least square regularized regression with a single Hilbert space for single task, with confidence , the largest term in sample error can be bounded by the solution of the following equation:

If we assume with some large enough, in Proposition 10 can converge to fast. Then, we can obtain , while the regularization error of multitask learning with multiple Hilbert spaces is smaller. Trading off the regularization error and sample error, we can obtain potential faster learning rate than that of single task learning.

#### Acknowledgments

This work was supported in part by the Fundamental Research Funds for the Central Universities under Grant ZY1118, National Technology Support Program under Grant 2012BAH05B01, and the National Science Foundation of China under Grants 11101024 and 11171014.

#### References

- R. Caruana, “Multitask learning,”
*Machine Learning*, vol. 28, no. 1, pp. 41–75, 1997. View at Scopus - Y. Zhang and D.-Y. Yeung, “A convex formulation for learning task relationships in multi-task learning,” in
*Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence (UAI '10)*, pp. 733–742, Catalina Island, Calif, USA, July 2010. View at Scopus - T. Evgeniou and M. Pontil, “Regularized multi-task learning,” in
*Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, pp. 109–117, New York, NY, USA, August 2004. View at Scopus - A. Argyriou, T. Evgeniou, and M. Pontil, “Convex multi-task feature learning,”
*Machine Learning*, vol. 73, no. 3, pp. 243–272, 2008. View at Publisher · View at Google Scholar · View at Scopus - J. Liu, S. Ji, and J. Ye, “Multi-task feature learning via efficient L2,1-norm minimization,” in
*Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI '09)*, pp. 339–348, Montreal, Canada, June 2009. View at Scopus - B. Bakker and T. Heskes, “Task clustering and gating for bayesian multitask learning,”
*Journal of Machine Learning Research*, vol. 4, no. 1, pp. 83–99, 2004. View at Publisher · View at Google Scholar · View at Scopus - A. Argyriou, C. A. Micchelli, M. Pontil, and Y. M. Ying, “A spectral regularization framework for multi-task structure learning,” in
*Proceedings of the 21st Annual Conference on Advances in Neural Information Processing Systems (NIPS '07)*, December 2007. View at Scopus - J. Guinney, Q. Wu, and S. Mukherjee, “Estimating variable structure and dependence in multitask learning via gradients,”
*Machine Learning*, vol. 83, no. 3, pp. 265–287, 2011. View at Publisher · View at Google Scholar · View at Zentralblatt MATH · View at MathSciNet - Y. Xue, X. Liao, L. Carin, and B. Krishnapuram, “Multi-task learning for classification with Dirichlet process priors,”
*Journal of Machine Learning Research*, vol. 8, pp. 35–63, 2007. View at Zentralblatt MATH · View at MathSciNet - S. J. Pan and Q. Yang, “A survey on transfer learning,”
*IEEE Transactions on Knowledge and Data Engineering*, vol. 22, no. 10, pp. 1345–1359, 2010. View at Publisher · View at Google Scholar · View at Scopus - J. Baxter, “Learning internal representations,” in
*Proceedings of the Workshop on Computational Learning Theory (COLT '95)*, Morgan Kaufmann, San Mateo, Calif, USA, 1995. - N. Intrator and S. Edelman, “Making a low-dimensional representation suitable for diverse tasks,”
*Connection Science*, vol. 8, no. 2, pp. 205–224, 1996. View at Publisher · View at Google Scholar · View at Scopus - S. Thrun, “Is learning the n-th thing any easier than learning the first?” in
*Proceedings of the Advances in Neural Information Processing Systems (NIPS '96)*, D. Touretzky and M. Mozer, Eds., 1996. - T. Heskes, “Solving a huge number of similar tasks: a combination of multi-task learning and a hierarchical Bayesian approach,” in
*Proceedings of the International Conference on Machine Learning (ICML '98)*, 1998. - B. Romera-Paredes, A. Argyriou, N. Berthouze, and M. Pontil, “Exploiting unrelated tasks in multi-task learning,” in
*Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS '12)*, pp. 951–959, La Palma, Spain, 2012. - J. Baxter, “A model of inductive bias learning,”
*Journal of Artificial Intelligence Research*, vol. 12, pp. 149–198, 2000. View at Zentralblatt MATH · View at MathSciNet - R. K. Ando and T. Zhang, “A framework for learning predictive structures from multiple tasks and unlabeled data,”
*Journal of Machine Learning Research*, vol. 6, pp. 1817–1853, 2005. View at Zentralblatt MATH · View at MathSciNet - S. Ben-David and R. S. Borbely, “A notion of task relatedness yielding provable multiple-task learning guarantees,”
*Machine Learning*, vol. 73, no. 3, pp. 273–287, 2008. View at Publisher · View at Google Scholar · View at Scopus - M. Solnon, S. Arlot, and F. Bach, “Multi-task regression using minimal penalties,”
*Journal of Machine Learning Research*, vol. 13, pp. 2773–2812, 2012. View at MathSciNet - Y. M. Ying and D.-X. Zhou, “Learnability of Gaussians with flexible variances,”
*Journal of Machine Learning Research*, vol. 8, pp. 249–276, 2007. View at Zentralblatt MATH · View at MathSciNet - Q. Wu, Y. Ying, and D.-X. Zhou, “Learning rates of least-square regularized regression,”
*Foundations of Computational Mathematics*, vol. 6, no. 2, pp. 171–192, 2006. View at Publisher · View at Google Scholar · View at Zentralblatt MATH · View at MathSciNet