Abstract

Extreme Learning Machine (ELM) as a fast and efficient neural network model in pattern recognition and machine learning will decline when the labeled training sample is insufficient. Transfer learning helps the target task to learn a reliable model by using plentiful labeled samples from the different but relevant domain. In this paper, we propose a supervised Extreme Learning Machine with knowledge transferability, called Transfer Extreme Learning Machine with Output Weight Alignment (TELM-OWA). Firstly, it reduces the distribution difference between domains by aligning the output weight matrix of the ELM trained by the labeled samples from the source and target domains. Secondly, the approximation between the interdomain ELM output weight matrix is added to the objective function to further realize the cross-domain transfer of knowledge. Thirdly, we consider the objective function as the least square problem and transform it into a standard ELM model to be efficiently solved. Finally, the effectiveness of the proposed algorithm is verified by classification experiments on 16 sets of image datasets and 6 sets of text datasets, and the result demonstrates the competitive performance of our method with respect to other ELM models and transfer learning approach.

1. Introduction

Neural networks for solving classification problems have been widely researched in recent years [1, 2], which has powerful nonlinear fitting and approximation capabilities. Extreme Learning Machine (ELM), as a Single-Layer Feedforward Network (SLFN), has been proven to be an effective and efficient algorithm for pattern classification and regression [3, 4]. It randomly generates the input weight and bias of the hidden layer without tuning and only updates the weight between the hidden layer and the output layer. With the regular least squares (or ridge regression) as prediction error, the output weight will be efficiently obtained in a closed form by Moore–Penrose generalized inverse [3]. As a result, it has the advantages of strong generalization ability and fast training speed, therefore, and it has been widely used in various applications, such as face recognition [5], brain-computer interfaces [69], hyperspectral image classification [10], and malware hunting [11].

Although the learning speed and generalization ability of ELM are of great significance, there do exist many disadvantages. To improve ELM, many algorithms have been put forward in both theories and applications. In response to the fact that the shortcoming of ELM can be highly affected by the random selection of the input weights and biases of SLFN, Eshtay et al. [12] proposed a new model that uses Competitive Swarm Optimizer (CSO) to optimize the values of the input weights and hidden neurons of ELM. For imbalance data classification, Raghuwanshi and Shukla [13] presented a novel SMOTE based Class-Specific Extreme Learning Machine (SMOTE-CSELM), a variant of Class-Specific Extreme Learning Machine (CS-ELM), which exploits the benefit of both the minority oversampling and the class-specific regularization and has more comparable computational complexity than the Weighted Extreme Learning Machine (WELM) [14]. In order to reduce storage space and test time, the Sparse Extreme Learning Machine (Sparse ELM) [15] and multilayer sparse Extreme Learning Machine [16] were proposed for classification. To overcome the bias problem of a single Extreme Learning Machine, Voting based Extreme Learning Machine (V-ELM) [17, 18] and AdaBoost Extreme Learning Machine [1921] are proposed to reduce the risk of selecting the wrong model by aggregating all candidate models. Moreover, some semisupervised ELM [2225] and unsupervised ELM [2628] algorithms were designed to utilize the large number of existing unlabeled samples for improving the performance of ELM and clustering. However, the above models are obtained under a typical assumption that the training and testing data are sampled from the identical distribution [29] and it may not always hold in many real worlds, yet the performance of ELM will degrade as a result of lacking sufficient samples with the same distribution for training model, and labeling samples are very expensive and costly [30].

Domain adaptation [3133], as an important branch of transfer learning, solves the above problems with the help of the knowledge from the source domain which is different from but related to the target domain and resolves the inconsistency of sample distribution between the source and target domains. Zhang and Zhang [34] extended ELM to handle domain adaptation problems with very few labeled guide samples in target domain and overcome the generalization disadvantages of ELM in multidomain application. Li et al. [35] proposed the TL-ELM (transfer learning-based ELM) which uses a small amount of labeled target sample and a large number of labeled source samples to construct a high-quality classifier. Motivated by the biological learning mechanism, an Adaptive ELM (AELM) algorithm [36] was put forward for transfer learning which introduced the manifold regularization term into ELM for image classification under deep convolutional feature and representation. AELM is semisupervised transfer learning because it requires labels in the target domain. Due to the difficulty of collecting labels, unsupervised methods are more desirable. Chen et al. [37] presented a transfer ELM framework to bridge the source domain parameters and the target domain parameters by a projection matrix, in which informative source domain features are selected for knowledge transfer and the L2, 1-norm was applied to the source parameters. Li [38] and Chen [39], respectively, proposed two unsupervised domain adaptation Extreme Learning Machines by minimizing the classification loss and applying the Maximum Mean Discrepancy (MMD) strategy on the prediction results. Among the above approaches, due to efficiently utilizing target label, supervised ELM for transfer learning is superior to unsupervised ones.

In this paper, we focus on supervised transfer learning and propose a supervised ELM model with the ability of knowledge transfer, called Transfer Extreme Learning Machine with Output Weight Alignment (TELM-OWA), in which there are a small number of labeled target samples and a large number of labeled source samples to build a high-quality classification model. Firstly, it builds two ELM models utilizing labeled source and target samples. Secondly, we use a mapping function that transforms the output weight of source ELM into one of target ELM to align the distribution between the domains. Thirdly, a regularization constraint for the approximation between the interdomain ELM output weight matrices is added into the objective function to improve the cross-domain transfer of knowledge. Finally, we transform the objective function into a standard ELM form to solve and classify. Our approach is illustrated in Figure 1. Extensive experiments have been conducted on 16 sets of image datasets and 6 sets of text datasets and demonstrated significant advantages of our method over traditional ELM and state-of-the-art transfer learning methods.

The main contributions of this paper are as follows: (1) An idea of subspace alignment is adopted to reduce the distribution discrepancy between domains. (2) We apply the approximation constraint between the interdomain ELM output weight matrices to realize the efficient transferring of knowledge across domains. (3) The objective function is solved in standard ELM form, which is efficient and easy to understand. (4) Our proposed method performs image classification experiments on object recognition and text datasets. The results verify its effectiveness and advantage.

The remainder of this paper is as follows: in Section 2, we briefly introduce domain adaptation and ELM. In Section 3, we present TELM-OWA. In Section 4, the experiment and analysis to verify the validity of TELM-OWA are illustrated. Finally, Section 5 is the conclusion of the paper.

2.1. Domain Adaptation

Transfer learning aims to learn a classifier for the target domain by leveraging knowledge from one or multiple well-labeled source domains. But if the source and target domains contain large different distribution data, its performance will be affected. In transfer learning, domain adaptation accelerates the cross-domain transfer of knowledge by minimizing the discrepancy between domains. According to “how to correct interdomain distribution mismatch,” domain adaptation can be roughly divided into three categories: sample weighting, subspace and manifold alignment, and statistical distribution alignment [33].

Sample weighting methods weigh each sample from the source domain to better match the target domain distribution and minimize the distribution divergence between two domains [40, 41], in which the estimation of the weights from the source samples is a key to this technique. The most classic sample-based transfer algorithm is TrAdaBoost proposed by Dai et al. [42]. It expands the AdaBoost algorithm and applies boosting technology to weigh the source and the target samples. Many algorithms are put forward to extend TrAdaBoost, such as DTrAdaBoost [43], Multisource-TrAdaboost (MTrA), and Task-TrAdaboost (TTrA) [44], Multi-Source Tri-Training Transfer Learning (MST3L) [45].

Subspace and manifold alignment methods try to align the subspace or manifold representations to preserve some important properties of data and simultaneously reduce the distribution discrepancy across domains. Subspace alignment (SA) [4648] first projects the source and target samples into subspaces, respectively, and then functions a linear mapping to align the source subspace with the target ones and reduce cross-domain distribution difference for knowledge transfer.

Statistical distribution adaptation methods aim to explicitly evaluate and minimize the divergence of statistical distributions between the source and target domains to reduce the difference in the marginal distribution, conditional distribution, or both. To achieve this purpose, many statistical distances, such as Maximum Mean Discrepancy (MMD) [49], Bregman divergence [50], and KL divergence [51], are proposed for domain adaptation. Transfer Component Analysis (TCA) [52], Joint Distribution Analysis (JDA) [53], Weighted Maximum Mean Discrepancy (WMMD) [54], Transfer Subspace Learning (TSL) [55], and so forth are proposed to simultaneously tackle feature mapping, adaptation, and classification.

2.2. Extreme Learning Machine (ELM)

ELM is a fast learning algorithm for the single hidden layer neural network. Compared with the traditional neural network learning, it has two characteristics: (1) hidden layer parameters (i.e., input weights and the biases) can be randomly initialized. (2) The output layer weight can be solved as the least squares problem. As a result, ELM has a faster learning speed and more excellent generalization performance than traditional learning algorithms while guaranteeing higher accuracy.

Suppose giving a training dataset with samples, where is the label corresponding to , and is the number of categories. The structure of the ELM is shown in Figure 2.

In Figure 2, is the input sample, is the input layer weight, is the hidden layer bias, is the nonlinear activation function, is the number of nodes in the hidden layer, and is the hidden layer output weight. The goal of ELM is to solve the optimal output weight by minimizing the sum of the squared loss function of the prediction error. The objective function is as follows:

In the previous equation, the first term is a regular term to prevent model overfitting, is the error vector corresponding to the -th sample, and is the tradeoff coefficient between the training error and the regular term.

Adding the constraint term to the objective function yieldswhere , , .

The objective function is considered as a ridge regression or a regular least square problem. By setting the gradient of the objective function with respect to to zero, we have

There are two cases in the process of solving . If , equation (3) is overdetermined [20]; the optimal solution iswhere is a -dimensional unit matrix.

If , equation (3) is underdetermined [23]; the optimal solution iswhere is an -dimensional unit matrix.

In the classification task, given a sample to be tested, the classification result can be obtained:where .

3. TELM-OWA

In the past few years, the theory and application of ELM have received extensive attention from scholars and great progress has been made in this field. However, when there are fewer training samples, the performance of ELM will decrease [34]. Transfer learning draws on relevant domain knowledge to improve the learning efficiency of tasks in the target domain [31]. Therefore, through transfer learning, the performance of ELM can be improved in the case of insufficient labeled samples.

In transfer learning, there are two different but related datasets: source domain and target domain . and are the source domain sample and its label, respectively, and is the number of samples. Accordingly, and are the target labeled sample and its corresponding label, respectively, is the target unlabeled sample, and are the number of labeled and unlabeled samples in , and . In this section, we hope to construct an ELM model using to obtain high accuracy on .

3.1. Output Layer Weight Alignment

By using the source domain labeled samples and the target domain labeled samples, respectively, two ELM can be built as follows:where is the hidden layer output matrix of and is the output layer weight of the ELM obtained by training. Accordingly, is the output layer output matrix of and is the out-layer weight of the ELM obtained by training. Due to the difference in the distribution between and , it can be known that . Inspired by the literature [46, 47], the transformation matrix is used to align the output layer of ELM between the source domain and the target domain in order to achieve cross-domain knowledge transferring. The function is established as follows:where is Frobenius mode. It can be known from the previous equation [43] that .

Since the Frobenius mode is invariant to the orthogonalization operation [46], equation (8) can be rewritten as

For equation (9), we can conclude that the optimal . Therefore, can be regarded as the output layer weight after the output layer of the source domain ELM model is aligned to the target domain, as shown in Figure 3.

3.2. Objective Function of TELM-OWA

In order to realize the transfer of the Extreme Learning Machine, the following objective function can be established to solvewhere is a regular term for facilitating knowledge transfer and preventing negative transfer and , are the balance parameter.

To align the output layer of source ELM to target one, we replace with and substitute it into equation (10) to get

Because of , equation (11) becomes

Because , we change equation (12) into

Let , and the objective function of TELM-OWA can be simplified asand, then,

After is obtained with knowledge transferability, the test samples are classified by equation (6). A complete classification procedure of TELM-OWA is summarized in Algorithm 1.

Input: Dataset and , trade-off parameters , , and .
Output: Output layer weight .
Step 1: Use to calculate according to equation (6).
Step 2: Solve , , and by using , , and .
Step 3: Solve the output weight according to equation (15) .
Step 4: Use to predict and get its label.
3.3. Discussion

In order to improve the classification performance of ELM under transfer learning environment, we propose TELM-OWA and its objective function is equations (11) to (14) which can be seen as follows:(1)Compared with the traditional ELM, TELM-OWA adopts to utilize the source domain knowledge to help the target ELM to obtain the optimal parameter and also increases the fitness of to the target domain data by .(2)DAELM-S proposed by Pan and Yang [34] also applies to help target task, in which the objective function is as follows:Though DAELM-S uses to transfer the knowledge from the source domain and increases the fitness of to source data, this decreases the fitness to the target domain comparing with TELM-OWA in which is more approximated to than by applying a subspace alignment mechanism.Therefore, can increase the fitness of to target data more than , and can promote the transfer of knowledge across domains. As a result, TELM-OWA has stronger knowledge transfer capabilities than DAELM-S.(3)Although DAELM-T proposed by Zhang et al. [34] uses to promote the approximation of and , the objective function is as follows:However, is obvious according to equation (9) and Figure 3. Therefore, TELM-OWA has a better knowledge transfer effect than DAELM-T.(4)Because TELM-OWA and DAELM-T need to firstly solve when solving the optimal parameter , therefore, compared with ELM and DAELM-S, TELM-OWA and DAELM-T have more computing complexity of , where is the number of hidden layer nodes.(5)In [37], PTELM also adopted Output Weight Alignment based on ELM for knowledge transfer. But there are two differences between PTELM and TELM-OWA. On one hand, PTELM is suitable for unsupervised transfer learning in which no target label is needed, but TELM-OWA is an supervised transfer learning algorithm requiring little target label. On the other hand, PTELM needs to solve the projection matrix for Output Weight Alignment and output weight adopting the coordinate descent method in alternatively optimizing manner. In TELM-OWA, output weight is only needed to be solved as the standard ELM.

4. Experiment and Analysis

To verify the validity of TELM-OWA, four different datasets, Office + Caltech object recognition, USPS and MNIST digital handwriting, MSRC and VOC2007 object recognition, Reuters-21578 text dataset, are used for classification experiments, where image and text datasets are described in Table 1. All the experiments are carried out on a PC with 8 GB memory and Windows 10 operating system. The algorithms are implemented in MATLAB 2017b. Each experiment is done 20 times, and the result is taken as average. The accuracy of each algorithm is evaluated by the accuracy rate and the formula is as follows:

4.1. Dataset Description
(i)USPS+MNIST: both USPS and MNIST are image datasets that describe handwritten numbers. They are different but related, with a total of 10 digital categories. During the experiment, two sets of experimental data (USPS vs. MNIST, MNIST vs. USPS) were constructed as follows: 1800 images were randomly selected from USPS as source and target domain datasets, and correspondingly, 2000 samples were randomly selected from MNIST as the target domain and source domain datasets. All pictures in USPS and MNIST are uniformly transformed into pixels of 16 × 16, and each picture is changed into a grayscale image representing pixel points by gray values.(ii)MSRC+VOC: the MSRC dataset is provided by Microsoft Cambridge, which contains 18 categories for a total of 4323 images. The VOC2007 dataset contains 20 categories for a total of 5011 images. MSRC and VOC2007 have distinct but different distributions. The MSRC is evaluated with standard images as benchmark data. VOC2007 is built freely with images from web albums. They share the following 6 semantic categories: airplanes, birds, cows, family cars, sheep, and bicycles. The transfer learning dataset MSRC versus VOC is constructed, in which 1269 subpictures are selected as the source domain dataset from the MSRC dataset, and 1530 subpictures are selected from the VOC2007 dataset as the target domain dataset. Then, we exchange the source and target domain to build a new set of transfer learning datasets VOC versus MSRC. We convert all the images into 0∼256 gray pixels and extract 240 dimensions as the spatial dimension of the sample.(iii)Office+Caltech: Office is a common dataset for visual cross-domain learning, with 3 realistic aggregated item datasets: Amazon (downloaded by online trading website), Webcam (photographed by low-resolution webcam), and DSLR (photographed by digital SLR high-resolution camera). This dataset contains 4,652 images in 31 categories. Caltech is also a standard dataset commonly used for target recognition. It contains 30,607 images in 256 categories. The Office + Caltech dataset released by Gong [56] contains four fields C (Caltech-256), A (Amazon), W (Webcam), and D (DSLR) in the 10 common classes. During the experiment, two different fields are randomly selected as the source and target domain datasets and 12 cross-domain target datasets can be constructed, namely, C⟶A, C⟶W, C⟶D, ..., and D⟶W.(iv)Reuters-21578: the Reuters-21578 text dataset, which is a common dataset for text categorization, containing 21,577 news articles from Reuters in 1987 that were manually labeled by Reuters with 5 classes including “exchanges,” “orgs,” “people,” “places,” and “topics.” 5 classes are divided into multiple major classes and subclasses. The three largest classes shown in Table 1 are “orgs,” “people,” and “place,” which can construct 6 cross-domain text classification tasks as orgs versus people, people versus orgs, orgs versus place, place versus orgs, people versus place, and place versus people. The article conducted a more intensive evaluation on 6 classification tasks.
4.2. Experimental Results and Analysis

We compared the proposed algorithm with some classifiers for evaluating the performance.

4.2.1. Classifier of Nontransfer Learning
(i)1NN: k nearest neighbor classifier with one nearest neighbor.(ii)SVM: support vector machine with the linear kernel.(iii)ELM: Standard Extreme Learning Machine.(iv)SSELM [23]: ELM with graph regularization term for semisupervised learning.
4.2.2. Classifier for Transfer Learning
(i)TCA [52] + 1NN: classifier is built by combining TCA with 1NN for the classification task of transfer learning.(ii)TCA [52] + SVM: classifier is built by combining TCA with SVM for the classification task of transfer learning.(iii)JDA [53] + 1NN: classifier is built by combining JDA with 1NN for the classification task of transfer learning.(iv)JDA [53] + SVM: classifier is built by combining JDA with SVM for the classification task of transfer learning.(v)DAELM-S [34]: ELM trained using a number of source labeled data and a limited number of target labeled data for domain adaptation.(vi)DAELM-T [34]: ELM trained using a limited number of target labeled data and numerous target unlabeled data to approximate the prediction from ELM trained using source data; ARRLS [57]: a general transfer learning framework referred to adaptation regularization based transfer learning using squared loss.(vii)TELM-OWA: we proposed a supervised classifier called Transfer Extreme Learning Machine with Output Weight Alignment.

In the experiment, we set the SVM penalty parameter belonging to , and the penalty parameter in ELM, SSELM, DAELM_S, DAELM_T, and TELM-OWA. TCA and JDA are feature transfer algorithms, which are combined with PCA to achieve the extraction of shared feature subspace based on MMD. In the above feature transfer algorithms, the dimension of the feature subspace is . The value range of the balanced-constraint parameter of the projection matrix in TCA and JDA algorithm is ; ARRLS algorithm combines JDA with structural risk minimization and graph regular terms to improve knowledge transfer effect. Its parameters are set according to [57].

Among them, in each dataset, 20% of the total number of target domain samples are randomly selected as a small number of labeled samples and are used as test sample sets together with source domain samples. In1NN, SVM, ELM, SSELM, TCA + (1NN, SVM), JDA + (1NN, SVM), and ARRLS, the labeled samples from the source and target domain are used together to train the classifier. Table 2 shows the classification results of the algorithms on the image and text datasets.

The classification results from Table 2 and Figures 47 prove the following: (1) first, the average accuracy of TELM-OWA across 22 tasks is 72.13%, which is obvious that TELM-OWA outperforms other methods on most tasks. (2) TELM-OWA outperforms DAELM_S, DAELM_T, indicating the superiority of Output Weight Alignment and , which promotes the transfer of knowledge across domains. (3) TELM-OWA DAELM_S and DAELM_T achieve good results compared to other most algorithms. It shows that ELM with the ability of knowledge transfer has a high performance for transfer learning. (4) The standard machine learning methods, that is, 1NN, SVM, and ELM, suffer from the domain shift problem; thus, they could obtain an unsatisfied performance. But ELM gains more significant performance than 1NN and SVM because of its good fitness and generality to data. (5) The semisupervised method SSELM performs better than ELM by exploring the geometry property of domain, but worse than TELM-OWA, DAELM_S, and DAELM_T without considering domain shift problem. (6) Due to the lower accuracy of 1NN, TCA + 1NN and JDA + 1NN are worse than SVM, ELM, TCA + SVM, and JDA + SVM but higher than 1NN. (7) The accuracy of the feature extraction algorithm with transfer capability, such as TCA + SVM and JDA + SVM, is higher than SVM, which is similar to 1NN as a classifier, indicating the importance of feature transfer learning in the case of few or not the same distribution samples. (8) The accuracy of JDA + 1NN and JDA + SVM is generally higher than TCA + 1NN and TCA + SVM, which indicates the superiority of reducing the marginal and conditional distribution discrepancy at the same time. (9) ARRLS generally outperforms all baseline methods by minimizing the difference between both marginal and conditional distributions, meanwhile preserving the manifold consistency.

The computer time-consuming algorithms of 1NN, SVM, ELM, SSELM, TCA + 1NN, TCA + SVM, JDA + 1NN, JDA + SVM, DAELM_S, DAELM_T, ARRLS, and TELM-OWA on MNIST versus USPS datasets are investigated, respectively, as shown in Table 3. The following can be seen: (1) the time cost of method based ELM is less than other algorithms except for 1NN, indicating that the speed of ELM is superior to the other. (2) TELM-OWA consume more time than ELM, SSELM, DAELM_S, and DAELM_T, because it needs to firstly solve and then obtain . (3) SSELM consumes more time than ELM, SSELM, DAELM_S, and DAELM_T, because it needs to construct Laplace graph matric and then obtain . (4) The classifiers with feature extraction consume more time than the standard classifier according to it. (5) The cost time of the method based on SVM is higher than other algorithms. (6) JDA + 1NN and JDA + SVM apply an iterative manner to refine the pseudo label from target domains, so their time cost is higher than TCA + 1NN and TCA + SVM.

Moreover, in Tables 23 and Figures 47 we can see the following: (1) TELM-OWA, as an extension of ELM in transfer learning, also has faster learning speed and higher accuracy than other non-ELM methods, because it maintains the advantages of the good fitness of neural network and ridge regression model with a closed-form solution. (2) Although TELM-OWA has higher accuracy than ELM, SSELM, DAELM_S, and DAELM_T, it also has more learning time. When , if the number of hidden-layer nodes is reduced, its learning speed will improve but its accuracy has a small drop (seen in Figure 8(b). (3) TCA + 1NN, TCA + SVM, JDA + 1NN, and JDA + SVM, as two-stage feature transfer classifier (i.e., first feature extraction and then classification), is little weaker because their feature extraction and classification process is separated and cannot be unified into a unified optimization framework.

4.3. Parameter Analysis

To evaluate the performance variations of our TELM-OWA with the target domain labeled sample ratio , the number of hidden layer nodes and balance parameters , , , we conduct the experiments on the 4 datasets like org versus people, MSRC versus VOC, MNIST versus USPS, A versus D and the results are shown in Figures 8(a)–8(e). The following can be seen: (1) with the increase of the number of target labeled samples for training ELM, the accuracy of TELM-OWA is increasing, as shown in Figure 8(a). It can be known that when the target domain label sample is small, the source domain knowledge can help the target domain task. With target labeled sample increasing, the trained model better fits target data and has higher accuracy. (2) As shown in Figure 8(b), the accuracy of TELM-OWA increases with the number of hidden layer node on the 4 datasets. This verifies that a huge amount of hidden nodes are beneficial because they may force the ELM network to behave better on output function approximation. (3) In Figure 8(c), with the gradual increase of , the accuracy increases first and then little decreases. When is too small, the helpful information from source domain is underutilized leading to the low performance. When is too large, the trained model overfits the source domain samples, resulting in performance degradation. TELM-OWA achieves a good result when . Dataset org versus people is robust to changes in parameter . (4) In Figure 8(d), the accuracy exhibits a little rising and then declining tendency with increase of , in which better accuracy is obtained when . When is small, the performance is a little low because is far from . When is too large, will reduce the influence of the empirical risk error of labeled sample from source and target domains and the accuracy will degrade. (5) As shown in Figure 8(e), the accuracy increases first and then decreases with the increasing of the parameters which control the quality of and achieves better classification results when .

5. Conclusion

To solve the problem of the performance degradation of the traditional Extreme Learning Machine algorithm in the case of a small number of reliable training samples, in this paper, we propose TELM-OWA which is an Extreme Learning Machine with the ability of knowledge transfer. It reduces the distribution difference across domains by aligning the ELM output weight matrix between domains and introducing the approximation between the interdomain ELM output weight matrices to the objective function. Moreover, the objective function is transformed to the standard ELM form to solve. Many experiments were designed to compare our proposed algorithm with other related algorithms, and the results show that TELM-OWA has higher accuracy and better generalization performance.

TELM-OWA still has some limitations: (1) it still needs some labeled samples in the target domain, and it is not suitable for the supervised transfer learning environment. (2) It reduces the distribution difference across domains by aligning the ELM output weight matrix between domains and ignore the overall distribution differences in the output layer, in which the divergence of statistical distributions between the source and target domains still is different due to variance among each dimension. (3) Its shallow architectures lead to failure to find higher-level representations and thus can potentially capture relevant higher-level abstractions.

As a result, the following research focuses on the following three aspects to improve TELM-OWA: firstly, reliable samples selection is introduced for unsupervised transfer learning. Secondly, the effectiveness of knowledge transfer is further promoted by aligning the ELM output weight matrix and minimizing the divergence of statistical distributions together. Thirdly, as is similar to deep learning, TELM-OWA is improved by stacking it into a deep structure model for extracting deep feature.

Data Availability

To verify the validity of TELM-OWA, four different datasets, Office + Caltech object recognition, USPS and MNIST digital handwriting, MSRC and VOC2007 object recognition, and Reuters-21578 text dataset, are used for classification experiments. (1) https://github.com/jindongwang/transferlearning/blob/master/data/dataset.md; (2) https://www.cse.ust.hk/TL/index.html; and (3) http://ise.thss.tsinghua.edu.cn/∼mlong/publications.html. MSRC and VOC2007 object recognition datasets are released in the paper named “Transfer Joint Matching for Unsupervised Domain Adaptation”.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the National Key Research and Development Program of China under Grant 2016YFE0104600 and the National Natural Science Foundation of China under Grants U1804150 and 62073124.