Instance Transfer Learning with Multisource Dynamic TrAdaBoost
Since the transfer learning can employ knowledge in relative domains to help the learning tasks in current target domain, compared with the traditional learning it shows the advantages of reducing the learning cost and improving the learning efficiency. Focused on the situation that sample data from the transfer source domain and the target domain have similar distribution, an instance transfer learning method based on multisource dynamic TrAdaBoost is proposed in this paper. In this method, knowledge from multiple source domains is used well to avoid negative transfer; furthermore, the information that is conducive to target task learning is obtained to train candidate classifiers. The theoretical analysis suggests that the proposed algorithm improves the capability that weight entropy drifts from source to target instances by means of adding the dynamic factor, and the classification effectiveness is better than single source transfer. Finally, experimental results show that the proposed algorithm has higher classification accuracy.
In data mining, a general assumption for the traditional machine learning is that training data and test data have the same distribution. However, in the practical application, this assumption cannot be often met . By transferring and sharing different field knowledge for target task learning, transfer learning makes the traditional learning from scratch an addable one. This must improve the learning efficiency and reduce the learning cost [2, 3]. In 2005, Information Processing Techniques Office (IPTO) gave a new mission of transfer learning: the ability of a system to recognize and apply knowledge and skills learned in previous tasks to novel tasks. In this definition, transfer learning aims to extract the knowledge from one or more source tasks and apply the knowledge to a target task . Since the transfer learning needs to use information from similar domains and tasks, its effectiveness is related to the correlation between the source and target domains.
However, transfer learning is more complex than traditional machine learning because of the introduction of transfer. There are many kinds of knowledge representation in related domains, such as sample instances, feature mapping, model parameters, and association rules. Due to the simpleness of implement, the paper selects sample instances as knowledge representation to design the effective transfer algorithm. In detail, instance transfer learning is used to improve the classification accuracy by finding training samples in other source domains which have strong correlation with the target domain and reusing them in the learning of target task . Obviously, how to decide weight of this training data should influence the effectiveness of candidate classifiers .
Up to now, researchers have proposed several approaches to solve transfer learning problems. Ben and Schuller provided a theoretical justification for multitask learning . Daumé and Marcu studied the domain-transfer problem in statistical natural language processing by using a specific Gaussian model . Wu and Dietterich proposed an image classification algorithm by using both inadequate training data and plenty of low quality auxiliary data . This algorithm demonstrates some improvement by using the auxiliary data, but it does not give a quantitative study using different auxiliary examples. Liao et al. proposed a new active learning method to select the unlabeled data in a target domain to be labeled with the help of the source domain data . Rosenstein et al. proposed a hierarchical Naive Bayes approach for transfer learning by using auxiliary data and discussed the applying time problem of transfer learning .
Transfer AdaBoost algorithm, also called TrAdaBoost, is a classic transfer learning algorithm which is proposed by Dai et al. . TrAdaBoost assumes that the source and target domain data use exactly the same set of features and labels, but the distributions of the data in the two domains are different. In addition, TrAdaBoost assumes that, due to the difference in distributions between the source and the target domains, some of the source domain data may be useful in learning for the target domain but some of them may not and could even be harmful. Since TrAdaBoost relies only on one source, its learning effects will become poor when there is a weak correlation between the source and target domains. Moreover, as the literatures [12–14] said, TrAdaBoost has the weaknesses of weight mismatch, introducing imbalance and rapid convergence of source weights. The purpose of this paper is to remove the weight drift phenomenon efficiently, improve learning efficiency, and inhibit the negative transfer.
2. Multisource Dynamic TrAdaBoost Algorithm
Considering the correlation between multiple source domains and the target domain, recently Yao and Doretto proposed multisource TrAdaBoost (MSTrA) transfer learning algorithms . As an instance-based transfer learning method, MSTrA selects its training samples from different source domains. At each iteration, MSTrA always selects the most related source domain to train the weak classifier. Although this can ensure that the knowledge transferred is relevant to the target task, MSTrA ignores effects of other source domains. Samir and Chandan proposed an algorithm (DTrAdaBoost) with an integrated dynamic cost to resolve a major issue in the boosting-based transfer algorithm, TrAdaBoost . This issue causes source instances to converge before they can be used for transfer learning. But DTrAdaBoost has low efficiency of learning.
In order to overcome the above disadvantage, a multisource dynamic TrAdaBoost algorithm (MSDTrA) is proposed. By using this algorithm, the rate of convergence of source sample weight will be reduced based on weak correlation to target domain . Supposing there are source domains, ; source tasks, ; and source training data, , the purpose of transfer learning is to make good use of them to improve the learning effectiveness of the target classifier function . In detail, the algorithm steps of MSDTrA are described as follows.
Step 1. Initialize the weight vector , where are the weight vectors of training samples with th source domain and are the weight vectors of training samples in target domain.
Step 2. Set the value of as follows: where is the number of all source domains training samples and is the sample number of training sets with th source domain.
Step 3. Empty the set of candidate weak classifiers and normalize the weight vectors to 1.
Step 4. Select a base learner to obtain the candidate weak classifiers based on training set ; calculate the error of on according to the following equation:
update the weight of by using the vectors update strategy:
Repeat the above method until all source domains are traversed, where is the error rate of candidate weak classifiers with th source domains in target domain. stands for error classified with the candidate weak classifiers. According to the vectors update strategy above, the error of each weak classifier in the target training set is computed and a weight is assigned to each weak classifier according to the error. The larger the error is, the smaller the weight becomes. In other words, source domains which correspond to those classifiers with high classification accuracy contain much valuable information for the learning of target task.
Step 5. Integrate all weighted weak classifiers to obtain a candidate classifier at the th iteration: where the classification error of on at iteration is where must be less than 0.5. Then, calculate the errors of the candidate classifier on the source and target training sets, based on which update the weights of training samples on the source and target domains. For the correct classified source training samples, their corresponding weights keep unchanged.
Step 6. Set where is the expression of dynamic factor . And Theorem 1 will provide the deduce process.
Step 7. Update the weight vector of source samples according to the following rule:
Update the weight of target samples according to the rule: where the weight update of the source instances uses the weighted majority algorithm (WMA) mechanism. This updated mechanism is computed by and . The target instance weights are updated by using , which is calculated on Step 6.
Step 9. Decide the final strong classifier
In the MSDTrA algorithm, TrAdaBoost’s ensemble learning is selected to train classifiers based on the combination set of source and target instances in every step. WMA is used to adjust weights of the source set by decreasing the weight of misclassified source instances and preserving current weights of correctly classified source instances.
It can be seen from the above algorithm that the MSDTrA allows all source training samples to participate in learning process at each iteration, and different source training samples are assigned different weights. If a source training sample can improve the learning of target task, it will be assigned a large weight. Overall, the MSDTrA takes full advantage of all useful knowledge from all source domains, and this can obviously enhance the learning effectiveness of target task.
3. Theoretical Analysis
The previous section introduced in detail the proposed new algorithm, that is, the instance transfer learning algorithm. In this section, related theory analyses will be given according to single source TrAdaBoost algorithm . First, Theorems 1 and 2 will proof the influence of source and target sample weight vectors with dynamic factor in source weight, respectively.
Theorem 1. A dynamic factor of that is applied to the source weights can prevent their weight drift and get the weight vector to update mechanism of source sample.
Proof. Set is sum of correctly classified target weights at boosting iteration and is sum of misclassified target weights at boosting iteration . Consider
Substituting for and to simplify the source update of TrAdaBoost, we have
Introducing the correction factor into the WMA, because of , we have
Theorem 2. The dynamic factor of that is applied to the source weights makes the target weights converge as outlined by TrAdaBoost.
Proof. In TrAdaBoost, without any source instances (), target weights for correctly classified instances will be updated as
Applying the dynamic factor to update the source instance weight, we can get the update mechanism of the target instance weight based on MSDTrA. Consider
Next, we analysis the performance of MSDTrA on the target training set.
Theorem 3. The final error on the target training set is
Proof. Supposing that the final sample set which contains all misclassified samples on the target domain is , the final error is .
At each iteration, the error on the target training set is where .
If the error on the target training set is 0, , training sample weights are not updated, . If and , the updating rule for the weights of target training samples is as follows:
In addition, we have the following criterion:
Combining (18) and (19), we have
Substituting into (20), we can obtain
According to Theorem 3, because the condition of is satisfied in the algorithm, the error in final target training data will decrease with the increase of iterations. And the upper bound of the associated generalization error can be calculated by , where is the VC-dimension of the weak classifier model.
4. Experimental Results and Analysis
The performance of the proposed method is investigated based on object category recognition in this section. Without loss of generality, we consider the following case: a small number of training samples of a target object category and a large number of training samples of other source object categories. For any test sample, we verify whether it belongs to the target object category or not.
4.1. Experimental Setting
For object category recognition, the Caltech 256 datasets that contain 256 object categories are considered. Practically, among 256 object categories, the 80 categories that contain more than 50 samples are used in our experiment. We designate the target category and randomly draw the samples that form the target data. The number of samples for training is limited between 1 and 50, while the number of samples for testing is 50. Furthermore, in order to illustrate the proposed method does not depend on the data set, we have also used the background dataset, collected via the Google image search engine, along with the remaining categories as our augmented background data set, to verify the effectiveness and robustness of this method.
The remaining categories are treated as the repository from which to draw positive samples for the source data. The numbers of source categories or domains are varied from 1 to 10 in order to investigate the performance of the classifiers with respect to the variability of domains. The number of samples for one source of data is 100. For each target object category, the performance of the classifier is evaluated over 20 random combinations of source object categories. Given the target and source categories, the performance of the classifier is obtained by averaging over 20 trials of experiments. The overall performance of the classifier is averaged over 20 target categories. SVM is selected as base classifiers and the iteration is 50.
4.2. Error Analysis
Since transfer learning is not needed to get good classification results when the target data set is large, standard cross-validation method is not used here. Small portion data of the target set are used for training, and most of the remaining samples are used for testing. Figure 1 compares AdaBoost, TrAdaBoost, MSTrA, and MSDTrA based on the area under the receiver operating characteristic curve (ROC) with different number of target training samples () and different number of sources in the field (). Moreover, for the area bounded by the ROC curve and the -axis, is used to evaluate the performance of different algorithms.
|(a) The number of source domains|
|(b) The number of training instances|
Practically, fixing the number of source domains , Figure 1(a) shows the ROC curves of the four algorithms with the increase of the number of training instances. Since AdaBoost does not transfer any knowledge from the source, its performance depends mainly on the number of . For a very small value of , it performs slightly improvement as the ROC curves show. However, due to the transfer learning mechanism, TrAdaBoost has good improvement by combining the three sources. By incorporating the ability to transfer knowledge from multiple individual domains, MSTrA and MSDTrA demonstrate a significant improvement in recognition accuracy, even for a very small . In addition, the performance of AdaBoost and TrAdaBoost strongly depends on the selection of source domains and target positive samples, as the standard deviation of shows.
Fixing the number of training instances , Figure 1(b) shows the ROC curves of the four algorithms with increase of the number of source domains. We can see that as the number of source domains increase, the of MSTrA and MSDTrA increases and the corresponding standard deviations also decrease. This indicates an improved performance in both accuracy and consistency. Since TrAdaBoost is incapable of exploring the decision boundaries separating multiple source domains, its performance keeps unchanged regardless of the number of source domains.
Figure 2 compares the classification performance of different methods in the target domain. We can see that AdaBoost algorithm does not transfer source domain knowledge and gets lower classification accuracy. DTrAdaBoost has relatively poor test results, because it only uses one source domain training sample set and gains the least useful knowledge from source domain. CDASVM based on structural risk minimization model fully considers the source domain sample information, and thus it has good classification accuracy. MSTrA and MSDTrA use four different sources domain training sets which contain more useful information, so they get higher testing accuracy than other algorithms. In each set of experiments, MSTrA only selects classification with the highest accuracy at each iteration and ignores the impact of other source domains to target tasks. But by adding the dynamic factors and weighting mechanism, MSDTrA makes better use of all sources domains useful knowledge and eliminates the influences of unrelated samples in sources domains training set to target tasks, so it has better performance than MSTrA algorithm.
In order to have objective and scientific comparison results, hypothesis testing is used on the experimental results. Let the variables , , , , denote the classification error rate of MSDTrA, MSTrA, CDASVM, DTrAdaBoost, and AdaBoost algorithms, respectively. Since the value of , , , , is subject to many random factors, we assume that they submit to normal distribution, , . Now, we compare the random variable means of these algorithms, (). The smaller the is, the lower the expected classification error rate is and the higher the efficiency is. Because the sample variance is the unbiased estimation of the overall variance, the sample variance value is used as an estimate of the generality variance. In this experiment the significance level is set as 0.01.
4.3. Time Complexity
Since several domains are used into the learning of target task together, time complexity of multisource domains is more than single domain. Supposing that the time complexities of training a classifier and updating weight are and , respectively, the time complexity of AdaBoos, DTrAdaBoost, MSTrA, and MSDTrA can be approximated to and . Furthermore, Figure 3 shows the average training time of the four algorithms with fixed , .
|(a) The number of source domains|
|(b) The number of training instances|
4.4. Dynamic Factor
This experiment will prove the effect of dynamic factor on source weights and target weights. Here a sources domain is considered, . In Figure 4(a), the number of instances is set as constant (, ) and the source error rate is set to zero. According to the WMA, the weights should not change because of ; that is, . When target error rates , the ratio of the weights of MSDTrA and MSTrA is plotted at different boosting iterations.
We can see from Figure 4(a) the following. In MSTrA, source weights converge always even the classification results are correct. MSDTrA matches the behavior of the WMA. If dynamic factor is not applied, the smaller the value of is and the faster the convergence rate of source weights is. In addition, for a weak learner with , MSTrA is still not able to get good performance by using over 1000 source instances, even though they were never misclassified.
4.5. Rate of Convergence
The number of source instances was set (), and the classification error is permitted to vary within the range of ; Figure 4(b) shows results after a single iteration with different number of target instances . It can be observed that after a single boosting iteration, the ratio of a correctly classified source instances increases with the increases of .
Considering the situation that sample data from the transfer source domain and the target domain have similar distribution, an instance transfer learning method based on multisource dynamic TrAdaBoost is provided. By integrating with the knowledge in multiple source domains, this method makes good use of the information of all source domains to guide the target task learning. Whenever candidate classifiers are trained, all the samples in all source domains are involved in learning, and the information that is beneficial to target task learning can be obtained, so that negative transfer can be avoided. The theoretical analysis and experimental results suggest that the proposed algorithm has higher classification accuracy compared with several existing algorithms.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This research is supported by the Fundamental Research Funds for the Central Universities (2013XK09).
H. Wang, Y. Gao, and X. G. Chen, “Transfer of reinforcement learning: the state of the art,” Acta Electronica Sinica, vol. 36, pp. 39–43, 2008.View at: Google Scholar
Q. Zhang, M. Li, and Y. H. Chen, “Instance-based transfer learning method with multi-source dynamic TrAdaBoost,” Journal of China University of Mining and Technology, vol. 43, no. 4, pp. 701–708, 2014.View at: Google Scholar
J. N. Meng, Research on the Application of Transfer Learning on Text Classification, Dalian University of Technology, 2011.
Y. Cheng, G. Cao, X. Wang, and J. Pan, “Weighted multi-source TrAdaBoost,” Chinese Journal of Electronics, vol. 22, no. 3, pp. 505–510, 2013.View at: Google Scholar
D. S. Ben and R. Schuller, “Exploiting task relatedness for multiple task learning,” in Proceedings of the 16th Annual Conference on Learning Theory, pp. 567–580, Washington, DC, USA, 2008.View at: Google Scholar
P. Wu and T. G. Dietterich, “Improving SVM accuracy by training on auxiliary data sources,” in Proceedingsof the 21th International Conference on Machine Learning (ICML '04), pp. 871–878, July 2004.View at: Google Scholar
X. Liao, Y. Xue, and L. Carin, “Logistic regression with an auxiliary data source,” in Proceedings of the 22nd International Conference on Machine Learning, pp. 505–512, ACM, August 2005.View at: Google Scholar
M. T. Rosenstein, Z. Marx, L. P. Kaelbling et al., “To transfer or not to transfer,” in Proceedings of the Neural Information Processing Systems Workshop on Transfer Learning (NIPS '05), p. 898, 2005.View at: Google Scholar
D. Pardoe and P. Stone, “Boosting for regression transfer,” in Proceedings of the 27th International Conference on Machine Learning (ICML '10), pp. 863–870, Haifa, Israel, June 2010.View at: Google Scholar
E. Eaton, Selective knowledge transfer for machine learning [Ph. D. dissertation], University of Maryland, Baltimore, Md, USA, 2009.
A. S. Samir and K. R. Chandan, “Adaptive boosting for transfer learning using dynamic updates,” in Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 60–75, 2011.View at: Google Scholar
Q. Zhang, M. Li, X. S. Wang et al., “Instance-based transfer learning for multi-source domains,” Acta Automatica Sinica, vol. 40, no. 6, pp. 1175–1182, 2014.View at: Google Scholar