Abstract

Due to its wide applications and learning efficiency, online ordinal regression using perceptron algorithms with interval labels (PRIL) has been increasingly applied to solve ordinal ranking problems. However, it is still a challenge for the PRIL method to handle noise labels, in which case the ranking results may change dramatically. To tackle this problem, in this paper, we propose noise-resilient online learning algorithms using ramp loss function, called PRIL-RAMP, and its nonlinear variant K-PRIL-RAMP, to improve the performance of PRIL method for noisy data streams. The proposed algorithms iteratively optimize the decision function under the framework of online gradient descent (OGD), and we justify the algorithms by showing the order preservation of thresholds. It is validated in the experiments that both approaches are more robust and efficient to noise labels than state-of-the-art online ordinal regression algorithms on real-world datasets.

1. Introduction

Ordinal regression, also called ranking learning, plays a central role in the learning task where the labels of data samples need to be ordered. It has been routinely used in social ranking tasks, e.g., collaborative filtering [1], ecology [2], and detecting the severity of Alzheimer disease [3], to name a few. In contrast to other types of regression analysis, ordinal regression describes the relationship between variables where the order matters. For example, the label of age category can be sorted as (0–9, 10–19, , 90–99) and the levels of bond credits can be sorted as “B”  “A”  “AA”  “AAA” [4]. In general, ordinal regression requires a linear (nonlinear) function and a set of thresholds, where each threshold corresponds to a class. However, ordinal regression also distinguishes multiclassification in the sense that the output labels have a natural order.

A seminal work of ordinal regression can be traced back to the proportional odds model (POM) proposed by McCullagh [5], in which a general class of regression models for ordinal data was studied. Later, large margin formulations for ordinal regression were proposed in [6, 7]. Both approaches outperformed existing benchmarks when they were applied to ranking and multiclass classification. Furthermore, to accelerate the training, Gu et al. [8] proposed asynchronous parallel coordinate descent algorithms. Besides, Berg et al. [9] showed that training a classifier can improve neural network accuracy and proposed that using several discrete data representations simultaneously can improve neural network learning compared to a single representation. In many situations, interval labels have to be used instead of the exact label. As an example of predicting product ratings, we get an entire range of scores (e.g., 1–3 and 4–7) from different customers. In [10], a large margin batch algorithm was proposed by using interval labels. The approaches discussed so far are batch algorithms. Recently, in order to reduce computational time, Crammer et al. [11] proposed an online learning algorithm, passive aggressive (PA) algorithm, for ordinal regression, which chooses an appropriate step size to find the new parameters in every trial. Later, Manwani and Chandra [12] employed the PA algorithm to solve problems with interval labels. Perceptron is another principled method of learning classifiers in online fashion. The authors in [13, 14] proposed an online algorithm (with similar principles to the classic perceptron used for 2-class separation) for finding the set of parallel hyperplanes which would comply with the separation rule. In [15], perceptron-based approach was proposed for ordinal regression using interval labels.

Although existing works have made huge progress in reducing memory requirement and computational time, training models on the dataset with poor quality, e.g., having a lot of noise, remains challenging. Existing scene classification algorithms predominantly focus on static data and are designed to learn discriminant information from clean data. They, however, suffer from two major shortcomings, i.e., the noisy label may negatively affect the learning procedure, and learning from scratch may lead to a huge computational burden. The authors in [16] propose a noise-resilient online classification algorithm, which is scalable and robust to noisy labels and applied it to peptide identification [17]. To reduce the negative influence of outliers, the authors in [18] propose a more robust algorithm termed as ramp loss for twin K-class support vector classification (Ramp-TKSVC) where the ramp loss function was used to substitute the Hinge loss function.

In this work, we aim to study a noise-resilient online learning algorithm to promote the online ordinary regression problem for noisy data streams. Particularly, we propose perceptron ranking algorithms using interval labels with ramp loss (PRIL-RAMP) and its kernel variant (K-PRIL-RAMP) for ordinal regression in an online learning manner. Our key contributions are as follows:(1)Propose linear and nonlinear noise-resilient online algorithms, i.e., PRIL-RAMP and K-PRIL-RAMP, and design a procedure to update the model parameters. The optimal parameters are obtained by online gradient descent (OGD) procedure.(2)It is theoretically proved that the proposed the PRIL-RAMP algorithm keeps the thresholds sequentially.(3)Practical experimental studies on various datasets demonstrate the effectiveness of the proposed algorithms by comparing them with state-of-the-art algorithms.

The paper is organized as follows. In Section 2, we discuss a generic framework of ordinal regression using interval labels. Furthermore, we introduce the proposed PRIL-RAMP and K-PRIL-RAMP algorithms and discuss the order preservation of thresholds of the proposed algorithms. In Section 3, we present experiments and illustrate the comparison results between PRIL-RAMP and the state-of-the-art algorithms. We conclude the whole paper in Section 4.

2. Method

2.1. Learning to Rank in Online Ordinary Regression

Ordinary regression has been successfully applied to a great number of real-world problems. Typical ordinary regression trains models in a batch manner, i.e., feeding all data into training model at once [68]. The batch training manner demands huge computation and memory for large-scale problems, and it is also not adaptable to streaming data. Comparing with the traditional batch learning framework, online learning (shown in Figure 1) learns samples in a streaming fashion so that it can be scaled and respond in real-time. We first introduce the online ordinary regression of the perceptron algorithm and its variant, and then present noise-resilient online learning algorithms PRIL-RAMP and K-PRIL-RMAP in the next section.

Let be the instance space, the label space, and a sequence of instance-rank pairs, where and are its corresponding rank, . We define with “” as the order relation. Every instance is assigned an interval label . Particularly, it becomes the exact label scenario when . Let be the training set. In ordinary regression, our objective is to learn model , which also defined as the upper interval endpoint:where function : is defined as , and thresholds are.

The lower interval endpoint can be defined similarly. Let denote the implicit constraints for ordering of thresholds in , and stands for the interval [8]. Then, we have the loss function defined as follows [13]:

As the loss function is convex for each , the sum of them is convex. Similarly, the sum of , is also convex. Note that the loss tends to if there is a large amount of noise. Loss function is primarily designed to learn information from clean data as the overall performance of the learning model degrades with interference of label noise.

2.2. Perceptron Ranking Using Interval Label Algorithms with Ramp Loss

We start the presentation of noise-resilient online learning algorithm with the PRIL algorithm and its variants.

2.2.1. PRIL Algorithm

Perceptron ranking using interval labels (PRILs) was first introduced in [13] for ordinal regression in online fashion. Let be the observed example, , and , which are the parameters of the online ordinary at time . Let . Then, we definewhere is an indicator function which takes value 1 if A is true. Thus and . Let . For given sample-interval pair, the loss only when

Therefore, requires that . The perceptron ranking loss function in linear situation can be rewritten as

2.2.2. PRIL-RAMP and K-PRIL-RAMP Algorithms

Now, we present the PRIL-RAMP procedure to minimize the estimated risk using the OGD structure [14] of ramp loss in online ordinary regression.

(1) Ramp Loss. To offset the influence of noise data, we apply ramp loss as a surrogate of the perceptron loss function. The surrogate of perceptron loss function is ramp loss is defined as

Notice that the ramp loss is a noise-resilient function, which makes the contribution of noise loss zero when tends to . In classification methodologies, robustness to noise is always an important issue. The effect of noise samples can be significantly large since the penalty given to the outliers by the perceptron loss is quite huge, as any convex loss is unbounded. On the other hand, the ramp loss has an upper bound, and hence, it can control the effect of noisy samples and remove the effect of noise. Plots of perceptron loss and ramp loss in Figure 2 show the robustness (noise-resilient) of the ramp loss.

In addition, the ramp loss is a difference of convex (DC) function and can be represented as the difference of two convex functions:where is perceptron loss function, as illustrated in Figure 3.

(2) PRIL-RAMP Algorithm. In the setting of ordinal regression, instead of considering all the categories to contribute errors to threshold, we allow the ramp loss function to update the thresholds. Ordinal inequalities on thresholds are satisfied automatically at the optimal solution.

Figure 4 uses an example to illustrate the update rule of PRIL-RAMP algorithm. In this example, we set . Note that is omitted from all the plots in Figure 4. The correct rank of the instance is , and thus, the value of should fall in the last interval, between and . However, as you can see in Figure 4, the value of falls below interval , and the predicted rank is . Threshold values of , and are errors since the values of , and are higher than . To mend the mistake, the algorithm decreases the values of by a unit and replace them with and . It also modifies to be since . Thus, the inner-product increases by . This update is illustrated at the middle plot of Figure 4. The updated prediction rule is sketched on the right-hand side. Note that after the update, predicted rank of is which is closer to the true rank . We allow a small error of one unit. Note that the data in the interval label (in red) are noise. By replacing perceptron loss with ramp loss, PRIL-RAMP guarantees that the parameters of this data with scoring are not updated, which significantly controls the effect of noisy data.

Next, we describe the procedure in PRIL-RAMP to update the parameters for predicting the label:where

Note that is a convex function while is a concave function. Let , then we have

The concave-convex procedure (CCCP) [19] can be used to obtain the optimal solution to the DC function. However, as is mentioned above, this batch learning manner cannot meet the real-time requirement when dealing with streaming data. In this work, we use the OGD method to find a near-optimal solution. It is a tradeoff between the accuracy and scalability. Particularly, we have

To estimate and , we initialize and . Let and be the estimates of the parameters in the beginning of trial , be the example observed, and be its label interval. and are estimated as follows:

Let , then we have

Note that sets are known at every trial . The complete description of the PRIL-RAMP algorithm is as given in Algorithm 1.

Input: Training set
Initialize: Set , and
Output: Predict
(1)for
(2) Receive instance from
(3) Predict:
(4) observe:
(5) Set
(6) Set
(7) Initialize
(8)for do
(9)  ifthen
(10)   
(11)  end if
(12)end for
(13)
(14)
(15)end for
(16)

(3) K-PRIL-RAMP Algorithm. For nonlinear ordinary regression, we choose the suitable kernel function as proposed by Manwani [15]. That is,where :

The Kernel PRIL algorithm with ramp loss (K-PRIL-RAMP) for PRIL-RAMP is discussed in Algorithm 2.

Input: Training set
Initialize: Set  = 0, and ; ;
Output Predict
for
(2)  Receive instance from
  Predict:
(4)  observe:
  Set
(6)  Set
  Initialize
(8)  for do
   ifthen
(10)    
(11)   end if
  end for
   
(14)   
end for
(16)
2.3. Theoretical Analysis of PRIL-Ramp Algorithm

Considering the property of PRIL algorithms, we now show that the PRIL-RAMP algorithm inherently maintains the ordering of thresholds at each iteration.

Theorem 1. Order preservation of thresholds in the PRIL-RAMP algorithm: let and be the thresholds at trial . Let be the instance at trial and be its corresponding interval label. Let be the updated thresholds using PRIL-RAMP. Then, .

Proof. We need to analyse four different cases as follows:(1)We know that . PRIL-RAMP does not update thresholds in this range. Thus, .(2): in this case, is in the update rule. Thus, using the fact that , we get(3): we see thatThus, there can be two cases only.(a): we simply get .(b): since , we get . That meansHowever, . Thus,(4): we see that(a): we simply get .(b): since , we get . That meansHowever, . Thus,This completes proof.
Similar demonstrations can be given for the K-PRIL-RAMP algorithm, and hence, we skip its proof. Theorem 1 shows that the K-PRIL-RAMP algorithm can keep the thresholds in sequential.

3. Experiments

We evaluate performance of the PRIL-RAMP algorithm by comparing it with other benchmark methods on datasets with various ratios of noise. All experimental studies are performed in MATLAB R2016a environment on a PC with 2.5 GHz Intel Core i5 processors and 8 GB RODRAM running under the Windows 10 operating system.

3.1. Dataset

All datasets are obtained from UCI (http://archive.ics.uci.edu/ml/) machine learning repository [20] and LIBSVM website, and data features are normalized to zero mean and unit variance coordinatewise. More details about the datasets are provided as follows:Abalone: the dataset contains 4177 instances with 8 attributes, related to the physical measurement of Abalone found in Australia. A typical task based on the dataset is to predict the age of the Abalone using the “Rings” attribute which varies from 1–29. We divide the target variable 1–29 into 4 intervals as 1–7, 8–9, 10–12, and 13–29 [12].Parkinsons-updrs: the dataset consists of a range of biomedical measurements by voice with early-stage Parkinson’s disease. There are 5847 instances with 21 features. The target variable of this dataset is “total UPDRS” (Clinician’s total UPDRS score) for the instance which varies from 7 to 54.992. There are 5 groups of target variable 7–54.992:  = (17, 17, 27, 37, 54.992) [12].Real estate valuation: the real estate valuation is collected from Sindian District, New Taipei City, Taiwan. This dataset has 414 instances with 7 attributes. We focus on predicting the house price of unit area which ranges from 0 to 200. The created 4 intervals are 0–20, 21–40, 41–60, and 61–200.

Table 1 shows the characteristics of the 5 datasets including the number of patterns, attributes, and classes, and also the number of patterns per class. To show the statistical properties of different datasets, we drew a box plot as shown in Figure 5. The x-axis denotes the dimensions of examples, and the y-axis corresponds to the outliers. It indicates that there exists noise in the datasets itself, especially in Parkinsons-updrs dataset.

3.2. Generating Noise

Noises are generated by choosing instances from the dataset randomly. Firstly, for covariance variables , the error term follows a mixture normal distribution [21]. The visualization of the noise datasets is proved in Figure 6. The red points denote the original examples, and the blue circles are corresponding to the outliers. Then, the label is generated in the target variables. For each of these examples, we randomly assign one of the following intervals, i.e., , where is the actual label. Finally, we consider and 50.

3.3. Kernel Functions

We use the following kernel functions for different datasets [15] for Kernel PRIL.Abalone: Parkinsons-updrs: Real Estate Valuation:

3.4. Result Comparison

We compare noise tolerance performance of PRIL-RAMP and K-PRIL-RAMP algorithms with basic PRIL and Kernel PRIL proposed in [15], respectively, where PRIL is perceptron-based approach for ordinal regression using interval labels in online. The difference between PRIL and Kernel PRIL is that the PRIL is linear while Kernel PRIL uses nonlinear kernel function in the experiments.

3.4.1. Accuracy

We use average accuracy as the metrics to evaluate the performance of all algorithms in the experimental studies. Accuracy, the short form of the average accuracy, is defined as (mean zero-one error) [22], where is the true label and is the predicted label. We run all algorithms 10 times and average the instantaneous accuracy across the 10 runs. The values of accuracy output by four algorithms running on different noisy data are reported in Figures 79. In the figures, the x-axis represents the number of examples, and the y-axis represents the accuracy on test examples. We select appropriate ramp loss parameter for PRIL-RAMP and K-PRIL-RAMP and set .

The results of experimental studies are summarized as follows:(1)We see that for all the datasets, accuracy will gradually increase and stabilize as increases. The accuracy in all algorithms decreases as the level of noise increases. The interference of label noise in the datasets does degrade the accuracy of prediction.(2)For different datasets, linear function algorithms (PRIL and PRIL-RAMP) and nonlinear function kernel algorithms (Kernel PRIL and K-PRIL-RAMP) perform not the same. That is determined by each dataset. Nevertheless, kernel algorithms need to solve a more complex model for ranking problems.(3)In general, PRIL-RAMP outperforms PRIL. Particularly, K-PRIL-RAMP shows consistently higher test accuracy than Kernel PRIL over all datasets with different noise levels. Also, noise-resilient algorithms (PRIL-RAMP and K-PRIL-RAMP) and noise-sensitive algorithms (PRIL and Kernel PRIL) work equally well in terms of the accuracy in the environment without noise added. Moreover, the gap of accuracy between noise-tolerances (PRIL-RAMP and K-PRIL-RAMP) and noise-sensitive (PRIL and Kernel PRIL) algorithms increases along with the level of noise.(4)On Abalone and Parkinson datasets, linear algorithms perform better than kernel methods when noise is added. The prediction accuracies of the PRIL and Kernel PRIL algorithms are relatively low and unstable in datasets with 25% and 50% noise, and this result shows that the two algorithms are very sensitive to noise. Interestingly, the noise-resilient PRIL-RAMP and K-PRIL-RAMP algorithms show better noise-tolerant performance. On the real estate valuation dataset, the accuracies of four algorithms are close when the dataset has 0% noise. However, when noise is added, PRIL-RAMP and K-PRIL-RAMP outperform Kernel PRIL across all levels of noise. In other words, PRIL-RAMP and K-PRIL-RAMP are more robust to noise.

Overall, the noise-insensitive PRIL-RAMP and K-PRIL-RAMP algorithms significantly outperform basic PRIL and Kernel PRIL, which indicates that ramp loss-based algorithms can effectively handle noisy data.

The major difference between linear algorithms (PRIL and PRIL-RAMP) and nonlinear (Kernel PRIL and K-PRIL-RAMP) is that PRIL and PRIL-RAMP use a constant step size whereas nonlinear algorithms choose a more complex step size to find the new and in every trial. In PRIL-RAMP and K-PRIL-RAMP, the step size is determined by solving an optimization problem of OGD whose ramp loss is a noise-resilient function, which makes the contributes of noise loss become 0 when there is a lot of noise, whereas PRIL and Kernel PRIL take all samples to ensure that the loss on the current example becomes 0. Thus, we see that the proposed PRIL-RAMP and K-PRIL-RAMP perform better or comparable to other methodologies and are noise resistance.

3.4.2. Noise Statistical Study

We investigate the proposed noise-resilient online ordinal regression algorithm in the case of covariance variable noisy data by considering the order. Specifically, we attempt to apply the statistical measures to answer the question of how effective the proposed PRIL-RAMP method is in handling data with noise input. In this subsection, we have reported a number of simulation studies on finite-sample performance evaluation . Considering order and statistical measure, we estimate the influence relation of prediction MAE (mean square error) [22], RMSE (root mean square error), Spearman’s correlation coefficient, discarded sample, discarded rate, and run time between PRIL, PRIL-RAMP, and PA-I [10] for different noisy data. and , where is predicted rank and is true one. The value ranges from 0 to (maximum deviation in the number of categories). It suggests that the larger the values are, the worse the prediction accuracy is. In ordinal matching, a useful and common measure is Spearman’s rank correlation which measures the correspondence in rank terms of the two distributions. It measures the structural similarity and values range from 0 to 1. The higher the value, the closer it is. The discard sample denotes the samples in the model that has no updating effect, namely, samples that make in each algorithm iteration. We calculate the number of discard samples by tracking the value of . The discard rate is the percentage of the discard sample and the total. For each dataset, we have four noise scenarios to discuss: 0%, 25%, 50%, and 75%.

The results are summarized in Tables 25, and as can be seen from the columns of MAE and RMSE, the proposed PRIL-RAMP method outperforms the competing PRIL and PA-I in four noisy data cases. Especially, in the case of a high level (50% and 75%), PRIL-RAMP significantly outperforms based on the algorithm PRIL. Spearman’s correlation coefficient shows that the correspondence of PRIL-RAMP in ranking terms of the two distributions is greater in three cases of noise. Empirically, as we increase the ratio of noise, the difference between the two methods in each evaluation index becomes apparent. Due to the intrinsic flaw of hinge loss, the PA-I method is highly sensitive to noise. Ramp loss can effectively reduce the impact of noise data in some sense, though except for individual data. Besides, as we increase our sample, the number of discarded samples increases with the increase of noise, compared with PRIL and PA-I. Most of the samples are discarded when the noise level is 75%. Not surprisingly, the percentage of the noisy data significantly affects the discarded rate and the results of the algorithms. Empirically, the discarded rate is approximately equal to the noise ratio. Moreover, the estimation of small samples can lead to the instability of the model because the prediction accuracy depends on how much proportion of noise points is used to train the model. It indicates that PRIL-RAMP is a better noise-resilient method than PA-I to deal with noisy data, though the latter is more faster.

3.4.3. Noise Sensitivity Comparison

In order to compare the trend charts of MAE and discard rate, we have plotted the 2D performance variation under different sets of noise in Figure 10. As can be observed, the proposed PRIL-RAMP outperforms the two competing methods (PA-I and PRIL) significantly in the case of high-level noisy data. Because of the perceptron loss, original PRIL is very sensitive to noise data. However, MAE of PA-I is higher than PRIL, because the Hinge loss function is more sensitive. Moreover, MAE will gradually increase and stabilize with increasing noise ratio. Specifically, when the noise ratio is greater than 50%, most of the noisy samples are discarded, so the MAE will gradually stabilize though it increases with the increase of noise data. Furthermore, due to the discard samples in the case of high-level noisy data, PRIL-RAMP can reduce the impact of noise data obviously.

4. Conclusion and Future Works

In this paper, we studied online learning from noisy data streams. First, we have proposed ramp loss-based ordinal regression models PRIL-RAMP and its nonlinear variant K-PRIL-RAMP to incrementally predict the interval labels in noisy environments in an online manner. Moreover, relevant theoretical analysis of the thresholds order preservation has been carried out, which shows that the proposed ramp loss is noise-resilient in the scenarios of interval label(s). At last, experimental studies have shown that PRIL-RAMP and K-PRIL-RAMP are robust and can be used to deal with noisy data streams. Future work will involve extending the ordinal regression model to the PA model by introducing of convex loss function [12] and based on convex optimization theory [23].

Data Availability

Datasets used to support the findings of this study can be available form (http://archive.ics.uci.edu/ml/).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under grant nos. 71961004, 71461005, and 71561008, the Thesis Cultivation Project of GUET Graduate Excellent Master under grant no. 2019YJSPY01, and the Innovation Project of GUET Graduate Education under grant no. 2019YCXS081.