Abstract

In recent years, self-paced learning (SPL) has attracted much attention due to its improvement to nonconvex optimization based machine learning algorithms. As a methodology introduced from human learning, SPL dynamically evaluates the learning difficulty of each sample and provides the weighted learning model against the negative effects from hard-learning samples. In this study, we proposed a cognitive driven SPL method, i.e., retrospective robust self-paced learning (R2SPL), which is inspired by the following two issues in human learning process: the misclassified samples are more impressive in upcoming learning, and the model of the follow-up learning process based on large number of samples can be used to reduce the risk of poor generalization in initial learning phase. We simultaneously estimated the degrees of learning-difficulty and misclassified in each step of SPL and proposed a framework to construct multilevel SPL for improving the robustness of the initial learning phase of SPL. The proposed method can be viewed as a multilayer model and the output of the previous layer can guide constructing robust initialization model of the next layer. The experimental results show that the R2SPL outperforms the conventional self-paced learning models in classification task.

1. Introduction

By assigning the samples in a meaningful learning order based on prior knowledge, curriculum learning (CL) [1] provides an easy-to-hard learning process, which makes the model more fits human cognition. To make curriculum learning more practical in dealing with machine learning problems, Kumar et al. [2] adaptively assessed the sample learning difficulty in model and proposed self-paced learning method. Specifically, self-paced algorithm actively and dynamically obtains the initial learning sequence from the original data and gradually increases the hard learning samples during each iteration. However, in curriculum learning, the sample learning course sequence is preset. By predefining or dynamically generating learning sequence, curriculum learning and self-paced learning can avoid main function falling into a bad local optimal solution. Many researchers applied curriculum learning and self-paced learning to some tough pattern recognition problems. In the literature [3], Jiang et al. proposed the self-paced curriculum learning, which not only obtains the dynamic sample sequence in the process of model learning, but also makes use of prior knowledge to avoid overfitting. Zhao et al. [4] applied the nonconvex problem of matrix decomposition, which suppresses effectiveness of the noise and outlier in the data on the model. Meanwhile, they pointed out that the strategy of adaptively selecting easy-learning sample sequences is similar to the process of human cognition. James et al. [5] adopted self-paced learning to SVM and achieved promising results in multimodal data retrieval. Self-paced learning has been introduced to many learning models and shown good performance in many real-world applications.

Self-paced learning seems to challenge the conventional learning methods, like active learning, boost, and transfer learning. In the view of machine learning, these boundary samples, noise samples, and outliers will increase the uncertainty of the model and may make the model generate a bad classification boundary. Therefore, compared to easy-learning samples, hard-learning learning samples have drawn much attention from the conventional model. In our work, we aim to deal with supervised learning problems, in which easy-learning samples correspond to samples with small loss while hard-learning samples correspond to sample with large loss. In unsupervised learning, easy-learning samples mean the samples that are easy to be determined while hard-learning samples denote the samples that will cause the model to be unstable. In the paper, misclassified samples are denoted as the samples that the product of the predicted value and the label is negative. Typically, in AdaBoost learning, the model trains the classifier by changing the sample distribution based on the misclassified samples of previous iterations [6]. Li et al. [7] applied the sequence of AdaBoost to train classifiers, starting with weak learner and progressively boosted as a strong learner. Active learning is a kind of semisupervised learning, and it chooses to label the most valuable samples for the model. These low-confidence samples that may contain useful information are difficult to be chosen, which requires additional expert knowledge to identify. Tur et al. [8] presented a spoken language understanding method by combining active and semisupervised learning with human-label and automatically labeled data. Huang et al. [9] proposed a systematic framework to simultaneously measure the informativeness and the representativeness of an instance. The informativeness criteria reflects the ability of samples in reducing the uncertainty of model based on the labeled data, while the representativeness measures which samples can well represent the unlabeled data. However, self-paced learning model first considers the easy-learning samples with small prediction loss and gradually adopts hard-learning samples with larger prediction loss to extend the training set. The difference between self-paced learning and transfer learning is that the transfer learning improves the generalization of the model by sharing the models in different tasks [10], while the self-paced learning updates and learns itself to obtain the local optimal solution.

Study [11] pointed out the inherent consistency between human recognition and reinforcement learning. In dealing with a learning problem, humans and other animals utilize a harmonious combination of repeating learning and hierarchical sensory processing systems. In self-paced learning, the initial model is trained insufficiency with a few easy-learning samples, which increases the learning risk of follow-up iteration and even reduces the generalization of the final model. The usual practice of solving the small sample problem contains feature selection [12], regularization, adding artificial samples [13], etc. In order to improve the generalization of the initial model consisting of small samples in self-paced learning, we design the recurrent framework, which uses the model of last self-paced learning iteration to repeatedly construct the initial model. Corresponding to the repeating learning process of humans, if the initial model inherits the property of large sample learning model, the obtained final model may be more robust and discriminative.

Meanwhile, although self-paced learning and some conventional machine learning methods (AdaBoost, active learning and transfer learning) are very different in sample processing, we can still absorb the advantages of these conventional methods into self-paced learning. Specifically, in this paper, we propose retrospective robust self-paced learning (R2SPL). In each iteration of self-paced learning, besides considering easy-learning samples, these misclassified samples of last iteration will also be involved in training the model. For example, if the hard-learning samples (their categories are difficult to determine) in the data are the majority, conventional self-paced learning may not get a good local optimal solution. In this case, our proposed method focuses on both easy-learning samples and misclassified samples in each iteration, which can drive the final mature model be robust and discriminative.

Overall, our main contribution can be summarized as follows:(i)We introduce these misclassified samples accompanied with easy samples with small loss in each iteration to guide the model becomes more discriminative.(ii)Retrospective self-paced learning is proposed to improve the robustness of the initialization of self-paced learning.(iii)Experiments results show the proposed method achieves promising result in classification tasks.

The remainder of this paper is organized as follows. We briefly introduce related works on self-paced learning in Section 2. We propose the robust SPL in Section 3. In Section 4, we conduct the experiments on UCI and ADNI datasets. We provide the conclusion and the future research plan in Section 5.

2.1. Curriculum Learning and Self-Paced Learning

In 2009, Bengio et al. [1] proposed a method of imitating children education order which is called curriculum learning. Different from conventional machine learning methods obtained from overall sample learning, in their work, they sorted the samples in a meaningful order and learned the model in several sections. Benefiting from the prior knowledge, curriculum learning can get better results than other machine learning models in some tasks. However, arranging the sample order usually requires expert identification, which increases the difficulty and cost of the model. In addition, the ordered sample sequence is static and lacks flexibility in dealing with new samples or tasks. To alleviate this deficiency, Kumar et al. [2] proposed self-paced learning in 2010. Without any prior knowledge and expert identification, self-paced learning can dynamically assign the samples from easy to difficult based on the fitness between the samples and the model. In multimedia retrieval, Lu et al. [14] proposed self-paced reranking model for multimodal data, and the model made significant progress on both image and video search tasks. Zhou et al. [15] brought the self-paced learning to deep neural network, which can adaptively involve the faithful samples into training process. By analyzing the work mechanism of self-paced learning, Fan et al. [16] proposed a general implicit regularized framework. Since self-paced learning is adopted into many models, the commonality among these models lies in the sample processing. In each iteration, these models usually pick these high-confidence samples which fit the model better to construct the current model and gradually use the remaining low-confidence samples to fine-tune the model to make it become more generalization.

Curriculum learning is the first attempt to combine human cognition sequence and machine learning model. Although curriculum learning has some drawbacks, it brings the idea of easy-to-hard learning to the latter models. Self-paced learning is the extension of curriculum learning, which is more flexible and concise. Similar to human learning, self-paced learning trains samples from easy to difficult and gradually improves the robustness of the model.

2.2. Tough Samples Learning

In the sample processing strategy, self-paced learning method is different with some tough samples focused learning methods, like AdaBoost, active learning, and transfer learning. In our work, we try to finely distinguish different types of samples, including easy-learning samples, hard-learning samples, and misclassified samples, and give them different weights in model. By combining the simple classifiers, AdaBoost can deal with complicated problem. For example, in many multiclass problems [17, 18], the distribution of samples is highly complex [19]. Like SVM, AdaBoost can asymptotically achieve a margin distribution which is robust to noise [7, 20, 21]. Active learning is a semisupervised model that uses the unlabeled samples to improve the model obtained by labeled samples. However, since the unlabeled samples have no tags, some data that are difficult to distinguish the types usually need to manually annotate. Otherwise, if these data are identified by the model, it may increase the uncertainty of the model. Lin et al. [22] proposed active self-paced learning that used the characteristic of these two models to automatically annotate the high-confidence and low-confidence samples and incorporated them into training under weak expert recertification. Kumar et al. [2] pointed out that certainty does not imply correctness. Many researchers performed SVM and active learning in some practical applications, like text classification [23, 24], image retrieval [25], and segmentation of images [26]. The model will adjust the weight of the data from original domain, which increases the similarity of the data between target domain and source domain [10]. In the process of children learning, some problems share a common underlying structure but differ in surface manifestations, which is similar to the characteristics of transfer learning [27]. In order to make the models close to human wisdom, many researchers combine the models with environmental feedback and transfer learning [2830].

In our work, we will focus on both the easy-learning samples and the tough samples, which improve the discrimination of self-paced learning. In each iteration, we will simultaneously select these easy-learning samples and misclassified samples to train. Like human cognition, it is beneficial to improve the generalization of the final self-paced model by simultaneously learning the high-confidence samples and low-confidence samples in each iteration.

3. Proposed Method

3.1. Robust SPL

Specifically, we define a diagonal weight matrix to denote misclassified weight of each sample. Let and represent the label and predicted value of -th sample, respectively. For binary classification problem, if , the -th sample is corrected classified. Otherwise, this sample is considered as misclassified sample. In our work, the weight of these misclassified samples () in weight matrix should be larger than these corrected samples, and the scope of this type of weight should not vary greatly. Therefore, in our work, we adopt sigmoid function, shown in Figure 1, as weight function with respect to the product of label and predicted value. Given the label vector , data matrix , and current model parameter , the misclassified weight of -th sample can be calculated asFor supervised problem, self-paced learning function assigns weight to samples based on the sample loss. Those samples with small loss will be viewed as easy-learning samples. However, our model simultaneously considers easy-learning samples and tough samples in each iteration. Specifically, we combine and self-paced weight matrix linearly. Then, the model can be formulated aswhere is the regularization term. To embed structure information in feature extraction, we adopt -norm on the regression coefficient . The closer the value of gets to 0, the sparser the result of the feature extraction is. is the self-paced weight function and controls the number of samples which is considered to construct the model. At first, only a few of samples with small loss can be utilized to construct the model. With the decrease of , more and more samples with larger loss will join the model training process.

Whatever the forms of self-paced function are, they should satisfy three properties [3, 14]: is convex with respect to ; the sample weight should be monotonically decreasing with respect to its corresponding loss; the sample weight should be monotonically decreasing with respect to the pace parameter .

Meanwhile, in the process of human cognition, people usually make mistakes due to the lack of knowledge. By constantly summarizing unfamiliar and misunderstand concepts, people can form a more robust knowledge system. Notably, if the children get the help with adults (like teachers and parents) in the process of cognition, they can construct the knowledge framework more rapidly and soundly. The self-paced learning is similar to the education process of children without the help of adults. Therefore, the learned initial model may be not robust enough. To alleviate this deficiency, in this paper, we proposed retrospective robust self-paced learning. Specifically, we cascade multiple self-paced learning algorithms, which can help to reduce the negative impacts in the initialization of follow-up self-paced learning process due to lack of sample simple size problem. Naturally, in the next self-paced learning process, the learning rate can be speeded up moderately. Repeating the process for several times, we can obtain more robust and discriminative model. The framework of the proposed method can be viewed as a multilayer network shown in Figure 2. Firstly, we construct the initial model based on these easy-learning samples. The obtained initial model does not have good discriminability due to lack of sample training. Then, we adopt more hard-learning and misclassified samples to retrain our model, which drives our model to be more robust and discriminative. When the training of this layer is finished, the convergence results will be used as the prior knowledge of the next layer model. Repeat this operation until all n-layer models have been trained. Because self-paced learning stage is essentially a layer of the network. Specifically, the output of the first layer can be used to guide choosing of the easy-learning samples in the initialization of the second layer, which can be expected to be more robust than that learn independently.

3.2. Optimization

Since the parameters , , and are independent with each other, we can fix other parameters when we calculate each of them. In -th iteration, each parameter can be calculated by the -th iteration parameters.In -th iteration, each portion is a convex problem; the optimal solutions of parameters can be achieved. The solutions of , , and are presented as follows.

(1) The Solution of To simplify the calculation, we convert the second term of (4) to . is a diagonal matrix and the diagonal elements can be calculated bywhere is the -th element of . Then, (4) can be equivalently formulated asGet the derivation of in (6) and set it to 0:Then, the optimal solution of isIn the next iteration, the model will correct its mistake by guiding the regression coefficient based on the parameter . Under the influence of the accumulation of and , which corresponds to easy samples with small loss and misdirected samples, our proposed self-paced learning model will be more robust and discriminative than conventional self-paced learning models which only consider easy sample in each iteration.

(2) The Solution of In our work, we define aswhere , is used to describe the lower bound of sample loss, and is used to describe the upper bound. Meanwhile, also describes the age of the model. In the initial stage, only easy samples with small loss are considered to construct the model. As grows, more and more complicated samples with larger loss will be adopted to the model to make it more mature. The sample weight can be calculated by our self-paced weight function . Get the derivation of in (9) and set it to 0:where is the squared loss of -th sample. Then, the optimal solution of is given byAs mentioned above, we adopt retrospective self-paced learning framework to increase the robustness and discrimination of model. Specifically, the step size of in the current self-paced learning process is smaller than that in the follow-up process. In our work, we set the step size of first self-paced learning layer is 0.1 and gradually increase it in follow-up process. To simplify the calculation, we set the number of layers to 3 in our method.

(3) The Solution of . The solution of can be calculated by (1). In our work, we apply sigmoid function to assign weight value to matrix . Using different self-paced weight functions in (10), we can obtain different models. In detail, we adopt three self-paced learning function, binary, linear, and logarithmic. can be formulated as follows.

(a) Binary

(b) Linear

(c) Logarithmicwhere . The solving algorithm of our model is shown in Algorithm 1.

1: Input:
2: : source domain data;
3: : target domain data;
4: , : self-paced parameter; : regularization parameter
5: For each layer of self-paced learning:
6: Initiate parameter by utilizing the result of last self-paced learning layer.
7: Repeating until convergence
8: Update by Eq. (12);
9: Update by Eq. (1);
10: Update by Eq. (8);
11: Train the model based on and .

4. Experiments

4.1. Settings

To evaluate the effectiveness of our proposed method, we conduct our experiment on ten binary classification datasets from UCI repository and Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. The detailed information of UCI datasets is presented in Table 1. AD data used in our experiment is obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). In our work, the Alzheimer’s Disease (AD) data have 913 samples with 116-dimension, which are consisting of 160 AD patients, 542 MCI patients, and 211 healthy controls (HC). Specifically, the MCI patients can be divided into three stages, 82 Significant Memory Concern (SMC) patients, 273 Early Mild Cognitive Impairment (EMCI) patients, and 187 Late Mild Cognitive Impairment (LMCI) patients. There are five modalities in the ADNI data, including ID (serial number), single nucleotide polymorphism (SNPdata), voxel based morphometry (VBM), fluorodeoxyglucose position emission tomography (FDG), and F-18 florbetapir PET scans amyloid imaging (AV45). In ADNI database, we perform three classification tasks, AD versus HC, MCI versus HC, and SMC versus LMCI. In each classification task, we compare our method with baselines SVM with RBF kernel, AdaBoost and conventional self-paced methods. In the sample processing, AdaBoost adjusts the distribution of training samples based on the performance of basic learners, which makes the misdirected samples in current iteration get more attention in the next iteration.

For conventional self-paced learning models whose weight functions are (13), (14), and (15), we define them as binary, linear, and log for short. Meanwhile, we construct two self-paced learning models based on our proposed models. If the parameter is not considered into the retrospective model, we call the model as Easy-SPL for short. When our proposed model is just one level self-pace learning containing parameter , it can be defined as Single-SPL. To obtain unbiased results, we adopt 10-fold cross-validation strategy with four measurements, including classification accuracy (ACC), sensitivity (SEN), specificity (SPE), and area under receiver operating characteristic curve (AUC). In UCI databases, we repeat all experiments 30 times with 2-folds cross-validation. In the experiment, we analyze the results of each layer of self-paced leaning process to determine the number of layers. We stop training the model when the convergence results of current self-paced learning process are not significantly improved compared with the previous iteration process. Then, we can determine the number of layers in our model.

4.2. Experimental Results on UCI and ADNI Data

At first, we verify the effectiveness of introducing tough samples and retrospective self-paced learning to the model in each iteration on ten UCI datasets. Figure 3 lists the results of each layer of the proposed self-paced learning method. Obviously, after introducing weight matrix , the model behaves more discriminative in each iteration. Meanwhile, the model of last self-paced learning process not only has better performance but also behaves more robust than the previous layer. Table 2 lists the ACC and AUC of seven baselines and our model. Our proposed model achieves all the best results on 10 UCI datasets and makes great improvement in several datasets.

We compared the proposed method, i.e., R2SPL, with several representative classification methods, including SVM, AdaBoost, SPL with binary, linear, or log function, Single-SPL (Sin-L), SPL without tough samples (N-SPL). The results are shown in Figure 4. We draw the precision-recall curves of these methods in Figure 5 and presented AUC and ACC results in Table 2. As seen from Figures 4 and 5 and Table 2, we find our methods outperform these comparison methods on all the ten datasets.

We also performed our method and comparison methods on ADNI dataset and conducted three classification tasks, AD versus HC, MCI versus HC, and SMC versus LMCI. The comparison results in three tasks, AD versus HC, MCI versus HC, and SMC versus LMCI, are listed in Tables 3, 4, and 5, respectively. Obviously, our proposed method has better performance compared with other methods in ACC, SEN, SPE, and AUC. It demonstrates the superiority of our model to other classifiers in AD classification problems.

4.3. Parameter Influence

Our model has two parameters including regularization term λ and sparse term p. We test the influence of the two parameters on 10 UCI datasets. The parameter λ is tuned from to and the value of sparse term p is adjusted from 0 to 2. When detecting the sensitivity of a parameter to the model, we only change the value of this parameter and fix the value of another parameter. Figure 6 shows the experimental results. Specifically, Figure 6 shows the influence of regularization term λ and parameter p. As we can see from Figure 6, when the λ is tuned from to , the performance of our proposed model is stable in most cases. Figure 6 also shows the influence of sparse term p. When p changes from 0.4 to 2, the performance of our proposed method changes slightly. We conduct multiple groups of experiment on 10 UCI datasets. The experiment results verify that our model is not sensitive to specific parameters and only related to the structure of the model.

4.4. The Convergence Results of Different Layers of Model

In the convergence analysis, we find that different models have different rates of convergence. The convergence results are listed in Figure 7; obviously, as the number of layers increases, the convergence speed of the model is also accelerating. Benefiting from the prior knowledge of previous iteration process, the current model can obtain local optimal solution faster.

5. Conclusion

In this paper, we divide the samples into easy-learning samples, hard learning samples, and misclassified samples and analyze their roles in learning. Then, we introduce tough or misclassified sample in the training of each iteration to self-paced learning. Meanwhile, considering the human cognition process, people usually need to constantly explore and learn from the same data or task to obtain a deep knowledge about it by multiple learning stages. So, we design the retrospective framework to improve the robust of self-paced learning, which uses the model in previous layer to reduce the negative effect of small sample size problem in the initialization phase of next iteration. The experimental results show that the proposed method behaves more robust and discriminative than conventional self-paced learning methods and many representative methods. In our further work, we will extend above framework to other learning tasks, such as semisupervised learning.

Data Availability

Raw data were generated at Nanjing University of Aeronautics and Astronautics. Derived data supporting the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by National Natural Science Foundation of China (nos. 61501230, 61732006, 61876082, and 61861130366), National Science and Technology Major Project (no. 2018ZX10201002), and the Fundamental Research Funds for the Central Universities (no. NP2018104).