Complexity

Volume 2018, Article ID 8425821, 13 pages

https://doi.org/10.1155/2018/8425821

## Unsupervised Domain Adaptation Using Exemplar-SVMs with Adaptation Regularization

^{1}School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing 100049, China^{2}Research Center on Fictitious Economy and Data Science, Chinese Academy of Sciences, Beijing 100190, China^{3}Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing 100190, China^{4}School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China^{5}School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China

Correspondence should be addressed to Yingjie Tian; nc.ude.sacu@jyt

Received 28 August 2017; Accepted 18 February 2018; Published 22 April 2018

Academic Editor: Shirui Pan

Copyright © 2018 Yiwei He et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Domain adaptation has recently attracted attention for visual recognition. It assumes that source and target domain data are drawn from the same feature space but different margin distributions and its motivation is to utilize the source domain instances to assist in training a robust classifier for target domain tasks. Previous studies always focus on reducing the distribution mismatch across domains. However, in many real-world applications, there also exist problems of sample selection bias among instances in a domain; this would reduce the generalization performance of learners. To address this issue, we propose a novel model named Domain Adaptation Exemplar Support Vector Machines (DAESVMs) based on exemplar support vector machines (exemplar-SVMs). Our approach aims to address the problems of sample selection bias and domain adaptation simultaneously. Comparing with usual domain adaptation problems, we go a step further in slacking the assumption of i.i.d. First, we formulate the DAESVMs training classifiers with reducing Maximum Mean Discrepancy (MMD) among domains by mapping data into a latent space and preserving properties of original data, and then, we integrate classifiers to make a prediction for target domain instances. Our experiments were conducted on Office and Caltech10 datasets and verify the effectiveness of the model we proposed.

#### 1. Introduction

Over the past decades, machine learning technologies have achieved significant success in various areas, such as computer vision [1], natural language processing [2], and video detection [3]. However, traditional machine learning methods assume that training and testing data come from the same domain, which implies that training or testing data are drawn from the same distribution and represented in the same feature spaces. This assumption is too violated to be held in the real world as collecting suitable and enough labeled data is time consuming and an expensive manual effort. Lacking labeled data, most of traditional machine learning methods always lose their generalization performance in reality. Therefore, it is desired to utilize the data of the relational domain to help training a robust learner for target domains. Driven by this requirement, transfer learning has rapidly developed in recent years [4]. Transfer learning slacks the assumption of the traditional machine learning in which data or labels are drawn from the same distribution and represented in the same feature space. In the transfer learning settings, it is always assumed that domains are similar or related, with even no relationships, which is instead of i.i.d. assumption. Thus, transfer learning has a strong motivation when developing the classical machine learning functions or applying the functions to real-world applications. Besides, transfer learning can be regarded as a supplement of classical machine learning methods. One is the problem of covariate shift or sample selection bias. Another motivation is that we want to train a universal or general model as a predictor for all the tasks, viewed as the parameter or learner shared. It is also considered as a goal of Artificial General Intelligence. Transfer learning aims to utilize source or related domains to help target domain tasks. It has achieved significant success in various practical applications, such as face recognition [5], natural language processing [6], cross-language text classification [7], WiFi localization [8], or medicine image [9].

Domain adaptation is a subproblem of transfer learning which assumes that source and target domain data are generated from the same feature and label space but different margin probability distributions. It aims to solve the problems that there is none or less labeled data in the target domain and usually use labeled data in the source domain to assist the training of target domain tasks. Massive works focus on the domain adaptation problems, and they also extend to some applications, such as WiFi location, text sentiment analysis, and image classification for multidomains. Since distribution mismatch generally exists in the real-world applications, there is also some other research area concern about domain adaptation. For example, extreme learning machine (ELM) is an efficient model for training single-hidden layer networks [10]. There are also some ELM works in a domain adaptation setting [11, 12]. They utilize most previous domain adaptation classifiers that have added constraint term which is based on using instance reweighting to minimize Maximum Mean Discrepancy (MMD) [13]. However, these methods need to assume that the difference between the source and target domain is not too large. Namely, this idea requires that different domains are similar.

Most pattern recognition problems can be transformed into several basic classification tasks. Generally speaking, classification tasks assume that a category can be represented by a hyperplane [14, 15], and most of the machine learning algorithms aim to learn hyperplanes to predict for unseen instances. Meanwhile, to improve the ability of representation by a hyperplane, there are some works which cluster the samples first and then solve the classification tasks on the clusters. In contrast to the category classification tasks, a cluster classifier can include more information about the positive category, but the more risks of overfitting. Motivated by the object detection, [16] proposed an extreme classification model training the classifiers for every positive instance and all the negative instances named exemplar support vector machines (E-SVMs). In fact, exemplar-SVMs can be viewed as an extreme situation of cluster-level SVM, in which every positive sample is regarded as a cluster. There are two viewpoints about the reason why the exemplar-SVM achieves a surprising generalization performance. One of the viewpoints is taking the exemplar-SVMs as a representation with complete details of positive instances. In other words, every classifier captures details of the positive instance like background, corner, color, or orientations and most of the classifiers can describe the category more intrinsically. From transfer learning viewpoint, training data cannot satisfy the underlying assumption of i.i.d., as every instance in the training set may be different from each other, namely, sample selection bias [17]. Each exemplar-SVMs classifier is trained on a high weight positive sample and other negative samples; it can represent the positive sample well in the same distribution. Recently, [18] extends exemplar-SVMs into a transfer learning form which uses loss function reweighting and adds a low-rank regularization item for classifiers.

In this work, we propose a novel model to address unsupervised domain adaptation problems that there is no label on target domain data. Furthermore, it permits distribution mismatch among instances. In our model, we train kernel exemplar classifiers for every positive instance and then integrate the classifier to make a prediction for target domain data. To align the distribution mismatch, we embed the regularization item based on TCA in our classifiers. In our opinion, the model constructs the bridge to transfer the knowledge, and we use the information in the kernel matrix which includes the instances representation in the high-dimension space to assist classifier training across domains. For the problem of sample selection bias, we integrate the classifiers to make a prediction. Basically, the step of integration is to expand the representation of hyperplanes that entirely take advantage of details learned before.

Our contributions are as follows. We propose a novel unsupervised domain adaptation model based on exemplar-SVMs named Domain Adaptation Exemplar Support Vector Machines (DAESVMs), and it improves standard domain adaptation prediction accuracy by transferring knowledge across domains. Every DAESVM classifier constructs a bridge that transmits knowledge from the source domain to target domain. Compared with the traditional two-step method, this strategy thoroughly searches the optimization point of the model which makes the classification hyperplane more precious about domains. To solve the problem of sample selection bias, we use the ensemble methods to integrate the classifiers. The process of the ensemble is similar to slacking the classification hyperplane, which drops off some unreliable classification results and use the reliable parts to make a prediction. We bring in the method of the pseudo label in DAESVMs inspired by [19] to supplement the information of target domain, and the experiments verify the effectiveness of the pseudo label. We push a step further to extend to implementing DAESVMs on the multidomain adaptation. The rest of this paper is organized as follows. In Section, we introduce the notation of the problem. Meanwhile, we review the related works of domain adaptation, exemplar-SVM, and Transfer Component Analysis (TCA). In Section, we introduce the deduction process of DAESVM and formulate the model. In Section, we propose the optimization algorithm for our model. In Section, we integrate all the DAESVMs classifiers to make a prediction. In Section, we analyze the experiments on some transfer learning dataset to verify the effectiveness of DAESVMs. In Section, we conclude our work and give an expectation.

#### 2. Notation and Related Works

This section will introduce the notation and related works about this paper.

##### 2.1. Notation

In this paper, we use the notation of [4] definition in transfer learning, and the definition just considers the condition of one source domain and one target domain. First, it needs to define the* Domain* and* Task*. Domain is composed of a feature space and a margin probability distribution , namely, . Task is composed of a label space and a prediction model , namely, . From view of probability, . Notations in this paper which are frequently used are summarized in the Notations and Descriptions section. The definition of transfer learning is as follows: Give a source domain data and a source task and a target domain data which is unlabeled and a target task . Transfer learning aims to utilize and to help train a robust prediction model on the condition of or .

##### 2.2. Domain Adaptation

As a subproblem of transfer learning, domain adaptation has achieved great success and is utilized in many applications. It assumes that source and target domain data have the same feature space, label space, and prediction function, from the view of probability, equaling conditional probabilities distribution, namely, or . It is agreed that the approaches of domain adaptation can be divided into three parts, reweighting approach, feature transfer approach, and parameter shared approach.

*(1**) Reweighting Approaches.* In the transfer learning tasks, the basic idea of utilizing the source data to help training target predictor is to reduce the discrepancy between the source and target data as far as possible. Under the assumption that source and target domains have a lot of overlapping features, a conventional method is reweighting or selecting the source domain instances to correct the marginal probability distribution mismatch. Based on the metric distance method between distributions named Maximum Mean Discrepancy (MMD), [20] proposed a technique called Kernel Mean Minimum (KMM) revising the weight of every instance to minimize MMD between the source and target domain. Being similar to KMM, [21] used the same idea but a different metric method to adjust the discrepancy of domains. Reference [22] used the strategy of AdaBoost to update the weights of source domain data, which improved the weight of instances in favor of classification task. It also introduced the generalization error bounds of model based on the PAC learning theory. In recent years, [23] used a two-step approach; first is sampling the instances which are similar with other domains as landmarks, and then use these landmarks to map the data into a high-dimension space, after which it is more overlapping. Reference [24] solved the same problem but slacked the similarity assumption; it assumes that there are no relationships between the source and target domain. The model named Selective Transfer Machine (STM) reweights the instance of personal faces to train a generic classifier. Most of instance-based transfer learning techniques use KMM to measure the difference of the distributions, and these methods are applied in many areas, such as facial action unit detection [25] and prostate cancer mapping [26].

*(2**) Feature Transfer Approaches*. Compared with instance-based approaches, feature-based approaches slack the similarity assumption. It assumes that source and target domain share some features named shared features, and domains have their own features named spec-features [27]. For example, when we train a task that uses movie critical to help sofa critical sentiment analysis classification task. The word “comfortable” is always nonzero in the sofa domain features but always zero in the movie domain features. This word is the spec-feature of sofa domain feature. Feature transfer approaches aim to find a shared latent subspace where the distance between the source and target domain is minimized. Reference [28] proposed an unsupervised domain adaptation approach named Geodesic Flow Kernel (GFK) based on kernel method. GFK maps data into Grassmann manifolds and constructs geodesic flows to reduce the mismatch among domains. It effectively exploits intrinsic low-dimensional structures of data in domains. To solve problems of cross-domain natural language processing (NLP), [29] proposed a general method structural correspondence learning (SCL) to learn a discriminative predictor by identifying correspondences from features in domains. Primarily, SCL finds the pivot features and then links the shared features with each other. Reference [7] learned a predictor by mapping the target kernel matrix to a submatrix of the source kernel matrix. The deep neural network is used not for learning essential features but also for domain adaptation. Reference [30] proposed a neural network architecture for domain adaptation named Deep Adaptation Network (DAN) and extended it to joint adaptation networks (JAN) [31]. Reference [32] discussed the transferable domain features on the deep neural network.

*(3**) Parameter-Based Approaches.* The core idea of parameter-based approaches aims to transfer parameters from source to target domain tasks. It assumes that different domains share some parameters and these parameters could be utilized for domains. Reference [33] proposed Adaptive Support Vector Machine (A-SVM) as a general method to adopt new domains. A-SVM trains an auxiliary classifier firstly and then learns the target predictor based on the original parameters. Reference [34] reweighted prediction of the source classifier on target domain by signing distance between domains.

##### 2.3. Exemplar Support Vector Machines

Reference [16] is proposed for object detection and getting high performance. It trains classifiers on every positive instance from all negative instances. Every positive instance is an exemplar and the classifier corresponding to it can be viewed as a representation of the positive instance. In the process of the prediction, every classifier predicts a value for the test instance and uses a function to make a calibration for the value and then gets the high score classifiers result as a predicted class. The exemplar-SVMs solve the problem that a hyperplane is hard to represent a category instance and utilize an extreme strategy to train predictor. In [35], they gather the training procession into one model and enter the nuclear norm regularization to the scene of domain generalization which assumes target domain is unseen. They also extend the model to the problem of domain generalization and multiview [36, 37]. In [38], they reduced two hyperparameters into one and spread exemplar-SVMs to a kernel form.

##### 2.4. Transfer Component Analysis

Reference [39] proposed a dimension reduction method called maximum mean discrepancy embedding (MMDE). By minimizing the distance of source and target domain data distribution in a shared latent space, the source domain data is utilized to assist training classifier on the target domain. MMDE is not only to minimize the distance between the domains in the latent space but also preserve the properties of data by maximum of the variance of data. Based on the MMDE, [40] extended it to have the ability of deal with the unseen instance and reduce the computation complexity of MMDE. Substantially, TCA simplifies the process of learning kernel matrix instead by transforming init kernel matrix. The optimization of this problem is equal to a solution in leading eigenvectors of object matrix.

#### 3. Domain Adaptation Exemplar Support Vector Machine

In this section, we present the formulation of Domain Adaptation Exemplar Support Vector Machine (DAESVM). In the remainder of this paper, we use a lowercase letter in boldface to represent a column vector and an uppercase in boldface to represent a matrix. The notation mentioned in Section is extended. We use , where is the number of positive instances, to represent a positive instance, and , where is the number of negative instances, to represent a negative instance. The set of negative samples are written as . This section introduces the formulation procession of an exemplar classifier. In fact, we need to train exemplar classifiers in the number of source domain instances and the method which integrates these classifiers is proposed in Section.

##### 3.1. Exemplar-SVM

The exemplar-SVM is constructed by an extreme idea of training a classifier by a positive instance from all the negative instances and then calibrating the outputs of classifiers into a probability distribution to separate the samples. The model trains the number of positive instance classifiers. Learning a classifier which aims to separate a positive instance from all the negative instance can be modeled as where is 2-norm of a vector and and are the tradeoff parameters corresponding to in SVM for balancing the positive and negative error cost. is a hinge loss function.

The formulation (1) is the primal problem of exemplar-SVM, and we can find the dual problem for utilizing kernel method. The dual formulation can be written as follows [38]:

are Lagrangian multipliers. is an identity vector. We take this model as an exemplar learner. The matrix is composed of

##### 3.2. Pseudo Label for Kernel Matrix

To make the best use of samples in source or target, we construct the kernel matrix on both domain data. However, in the dual problem of SVM, kernel matrix needs to be supplied labeled data. Our model is based on the unsupervised domain adaptation problem in which only source domain data are labeled. Motivated by [19], we use the pseudo label to help model training. Pseudo labels are predicted by classical classifiers, SVM in our model, which train on the source labeled data. Due to the distribution mismatch between source and target domain, there may be many labels incorrect. Followed by [19], we assume that the pseudo class centroids calculated by them may reside not far apart from the true class centroids. Thus, we use both domain data to supplement the kernel matrix with label information. In our experiments, we testify this method is effective.

##### 3.3. Exemplar Learner in Domain Adaptation Form

In fact, each exemplar learner is an SVM in kernel form which is trained by a positive instance and all the negative instances. In the opinion of [16], a discriminative exemplar classifier can be taken as a representation of a positive instance. However, in the task of object detection or image classification, this parametric form representation is feasible because of some characteristics in samples, such as angle, color, orientations, and background, which are hard to represent. The instance-based parametric discriminative classifier can include more information about positive samples. Similarly, with the motivation of transfer learning, we can view a positive instance as a domain, and there is some mismatch among domains. Our model aims to correct this mismatch and reduce the distance from the target domain. We construct an exemplar learner distance metric of domains from MMD and it can be written asHowever, it is just a metric of distance which is satisfied with our requirement of minimizing this distance by some transformation. Motivated by Transfer Component Analysis (TCA), we want to map the instance into a latent space that the instances from source and target domain are more similar and assume this mapping is . Namely, we aim to minimize MMD distance between domains by mapping instances into another space. We extend the distance function as follows:Corresponding to a general approach, it always reformulates (4) to construct a kernel matrix form. We define the Gram matrices on the source positive domain, source negative domain, and target domain. The kernel matrix is composed of nine submatrices, , , , , , , , , where .and it constructs the coefficient matrix , Thus, the primal distance function is represented by . Motivated by TCA [40], the mapping for primal data is equal to the transformation of kernel matrix generated by the source and target domain data. Utilizing the low-dimension transform matrix reduces the dimension of the primal kernel matrix. It maps the empirical kernel map into an -dimensional shared space. Mostly, we replaced the distance function by . In our case, we follow [40] and minimize the trace of the distance,For controlling the complexity of and preserving the data characteristic, we add the regularization and constraint item. The domain adaptation item is formulated followed from TCA and written aswhere is a tradeoff parameter and is an identity matrix. is a centering matrix.

Furthermore, the objective function of dual SVM needs to be added to the training label information which is similar to our model. Thus, we construct the training label matrix

is the label of a positive instance, is the label vector of negative source instances, and is the pseudo labels of target instances which are predicted by SVM before. It can be rewritten in another form:Label matrix provides the information of source domain data labels and target domain pseudo labels. The matrix in a dual problem of exemplar-SVM (2) is primal data kernel matrix. We want to replace it by mapping the kernel matrix into a latent subspace. Namely, replace by and the final objective function of each DAESVM model is formulated as follows:

#### 4. Optimization Algorithm

To minimize problem (12), we adopt the alternated optimization method which alternates between solving two subproblems over parameter and mapping matrix . Under these methods, the alternated optimization approach is guaranteed to decrease the objective function. Algorithm 1 summarizes the optimization procedure of problem (12) which we formulated.