Abstract
Extensions of kernel methods for the class imbalance problems have been extensively studied. Although they work well in coping with nonlinear problems, the high computation and memory costs severely limit their application to realworld imbalanced tasks. The Nyström method is an effective technique to scale kernel methods. However, the standard Nyström method needs to sample a sufficiently large number of landmark points to ensure an accurate approximation, which seriously affects its efficiency. In this study, we propose a multiNyström method based on mixtures of Nyström approximations to avoid the explosion of subkernel matrix, whereas the optimization to mixture weights is embedded into the model training process by multiple kernel learning (MKL) algorithms to yield more accurate lowrank approximation. Moreover, we select subsets of landmark points according to the imbalance distribution to reduce the model’s sensitivity to skewness. We also provide a kernel stability analysis of our method and show that the model solution error is bounded by weighted approximate errors, which can help us improve the learning process. Extensive experiments on several large scale datasets show that our method can achieve a higher classification accuracy and a dramatical speedup of MKL algorithms.
1. Introduction
Realworld problems in computer vision [1], natural language processing [2, 3], and data mining [4, 5] present imbalanced traits in their data, which may be developed by the inherent properties of the data or some external factors such as sampling bias or measurement error. Unfortunately, most traditional learning algorithms are designed based on balanced data and target the overall classification accuracy, leading the minority class to be overwhelmed by the majority class. However, the minority class in these realworld problems is usually more important and expensive than the majority class.
In the past few decades, many algorithms have been proposed to solve the class imbalance problems [6–8]. The datalevel methods artificially balance the skewed class distributions by data sampling [9, 10]. The algorithmlevel methods lift the importance of minority instances via the modification of existing learners [11, 12]. However, there usually exist complex nonlinear structures in these realworld imbalanced data. In this case, the extensions of kernel methods for the class imbalance problems have been proven very effective [13–15]. In [16], Mathew et al. overcome the limitations of the synthetic minority oversampling technique (SMOTE) for nonlinear problems by oversampling in the feature space of the support vector machine. In [17], a kernel boundary alignment algorithm is proposed to adjust the class boundary by modifying the kernel matrix according to the imbalanced data distribution. The kernelbased adaptive synthetic data generation (KernelADASYN) for imbalanced learning is proposed in [18], which uses kernel density estimation (KDE) to estimate the adaptive oversampling density. However, with the development of data storage and data acquisition equipment, the scale of data continues to grow. The existing kernelbased class imbalanced learning (kernel CIL) methods suffer from serious challenges that the cost of calculating and storing a vast kernel matrix is very expensive.
A general technique for making kernel methods scalable is kernel approximation, of which the Nyström method is the most popular one [19]. The Nyström method constructs a lowrank approximation of the original kernel matrix from a subset of landmark points, where is the data size. Computationally, it only needs to decompose a smaller matrix (denoted as ). However, according to the approximation error bound for the Nyström method in [20], there is a tradeoff between accuracy and efficiency. The more landmark points sampled provide improved approximation accuracy but require more computing resources, which results in the rapid expansion of the subkernel matrix as the data size increases and seriously affects the efficiency of the Nyström method.
Some works study the efficacy of a variety of fixed and adaptive sampling schemes for the Nyström method. For example, Musco et al. presented a new Nyström algorithm based on recursive leverage score sampling, which runs in linear time in the number of training points [21]. An ensemble Nyström method has been proposed to yield more accurate lowrank approximations by running mixtures of the Nyström method based on several subsets of landmark points randomly sampled [22]. However, the mixture weights of the ensemble Nyström method are defined according to the approximation error of each Nyström approximation, which may lead to the performance not as expected when applied to practical classification or regression applications. Recently, there emerges a fast and accurate refined Nyströmbased kernel classifier to improve the performance of the Nyströmbased kernel classifier [23]. Although the Nyström method has been studied extensively, there still exists a potentially large gap between the performance of learner learned with the Nyström approximation and that learned with the original kernel.
In this study, we propose a novel method, multiNyström, for large scale imbalanced classification. We incorporate the multiNyström method and multiple kernel learning to learn an improved lowrank approximation kernel superior to any one of each multiNyström approximation, where each approximation is defined by different kernel functions and subsets of landmark points. Moreover, unlike existing sampling schemes for the multiNyström method, our method selects subsets of landmark points according to the imbalance distribution to deal with the problem of skewed data. Without computing and storing the full kernel matrix, our method can scale to large scale scenarios. The main contributions of this study are summarized as follows:(1)We propose a multiNyström method to overcome the computational constraints of the Nyström method. Due to our method parallelized easily, it can generate more accurate approximates in large scale scenarios.(2)We optimize the mixture weights according to the data and the problem at the hand, so that the combined approximation kernel matrix can produce better performance. Moreover, the lowrank approximation can significantly speed up the existing MKL algorithms process.(3)We provide a stability analysis of our method, showing us the impact of kernel approximation error on the model solution and help determine the acceptable approximation error in the approximation of the kernel matrix.
The rest of this study is organized as follows. Section 2 introduces some related concepts. Section 3 then describes the proposed multiNyström approximation algorithm in detail. Experimental results and analysis compared with other algorithms are presented in Section 4. Finally, Section 5 summarizes the full work.
2. Related Work
2.1. Kernel Methods
Kernel methods such as support vector machines (SVMs) have become one of the most popular technologies of machine learning [24]. It can extend linear learners to nonlinear cases by introducing kernel trick. Consider a binaryclass dataset , where denotes an sdimensional vector and denotes its label. Define a nonlinear descriptor as
The input data are mapped to a highdimensional or even infinitedimensional feature space, and the inner product in the feature space is calculated implicitly through the kernel function defined in the input space.where is the kernel function that satisfies Mercer’s theorem [25], and is the corresponding reproducing kernel Hilbert space (RKHS). can simply be a classical kernel like the radial basis function (RBF) kernel. Unfortunately, the kernel matrix expands quadratically with the increase of data scale. The poor scalability limits the applicability of kernel methods in large scale scenarios.
2.2. Multiple Kernel Learning
Due to different kernels corresponding to different similarity concepts or using features from different views, MKL can obtain more complete representations of the input data by combining multiple kernels. In MKL, each instance is mapped into different feature spaces by a series of descriptors [26]:where represents feature from the m^{th} view of instance , , is the corresponding weight, and is the total number of predefined kernels. Then, substitute any dot product term with kernels:where each base kernel function is a positive definite kernel associated with an RKHS . The purpose of MKL is to learn a resulting discriminant function of the form with .
Based on the aforementioned definition, the seminal work in MKL proposes the following structural risk minimization framework as MKL primal problem with kernel weights on a simplex [27].where is the regularization parameter of the error term. is the slack variable. The L1norm constraint on the weight vector enforces the kernel combination to be sparse. We assume whenever in order to reach a finite objective. That implies if the weight of a certain kernel reaches , stop the optimization of since the solution is known [28].
Although MKL is an ideal candidate for combining multiview data, scalability is a key issue for MKL: (1) the computation and memory costs for maintaining several kernel matrices are heavy and (2) the computational efficiency of MKL solvers is not high.
2.3. Standard Nyström Method
Let , where denotes a set of landmark points randomly selected from uniformly without replacement, denotes the subkernel matrix between all instances and the landmark points, and be a symmetric positive semidefinite (SPSD) subkernel matrix among the points in . Then, the Nyström method uses and to generate a rank approximation of kernel matrix for [20]:where is the best rank approximation to with respect to the Frobenius norm, that is, , and denotes the pseudoinverse of . Given the matrix , the feature of each instance can be evaluated as
Calculate the singular value decomposition (SVD) of as , where is the orthonormal and is the diagonal with . Then, the final approximate decomposition of is denoted as the following form:where is the diagonal formed by the top singular values of , and is formed by the associated singular vectors.
The total time complexity of the Nyström method is including for SVD on and for matrix multiplication with [29]. For , it is much lower than the complexity taken by SVD on .
3. Proposed Algorithms
3.1. MultiNyström Method
We divide the imbalance dataset into the minority class set and the majority class set . When there are irregularities in the imbalanced data (such as small disjuncts, overlapping, and noise [30]) and the data scale is large, applying a single kernel may make the model biased, skew, or misleading. Inspired by the MKL algorithm [31], we construct a low rank approximate multiple kernel framework as follows:where corresponds to the rank approximation of each base kernel matrix , and is the corresponding mixture weight. As for the Nyström method, a key aspect is the sampling scheme [32]. For reducing the sensitivity to skewness in data, we adopt the stratified undersampling of the majority class to select subsets of landmark points written as with each . The subkernel matrix between all instances and the landmark points can be expressed aswhere . Then, we perform the standard Nyström method on each independently to get a rank approximation of each base kernel matrix . Finally, by linearly combining these approximations, we can get the general form of approximation multiple kernel :
Given the mixture weight , the feature of each instance can be evaluated as
Similarly, for the convenience of subsequent calculations, formula (11) can be rewritten aswhere , and denotes the approximate decomposition of obtained by (8). Figure 1 shows the proposed multiNyström method and includes an optimization process of the mixture weights detailed futher in next subsection.
When the mixture weight is fixed or known, the total time complexity of the multiNyström method is . Although our method requires times more CPU resources than the standard Nyström method, is typically O(1) for large scale data, and our method can compute in parallel in the distributed computing environment. Moreover, the SVD on the subkernel matrix is decomposed into that on much smaller matrices would also accelerate the calculation process.
3.2. Optimization to Mixture Weights
The purpose of MKL is to learn an optimal convex combination of a series of kernels during training. Based on the aforementioned definition, we propose an approximate multiple kernel learning framework for large scale imbalanced classification by modifying the original MKL framework in [26]wherewhere is the Lagrange multipliers vector, and . To avoid numerical instability caused by illconditioning [19], we substitute , where is a small positive constant called jitter factor. Moreover, to calculate the inverse of the approximate matrix and avoid storing the complete matrix , we iteratively perform the following series of operations:where is calculated using the SMW formula according to the last result . After performing the series of operations, we can obtain .
Lemma 1 (see [33]). Let and both be invertible; then, Sherman–Morrison–Woodbury (SMW) formula gives an explicit formula for the inverse of matrices if is invertible.
We can find that when the mixture weight is known, formula (15) is same as the dual problem of SVM. Hence, we havewhere is the optimal solution minimizing (15). With considered a constant in , can be regarded as a function of , and we calculate the gradient of the objective with respect to .
We use the reduce gradient method in [27] to deal with problem (14). First, for satisfying the L1norm constraint on the weight vector in (14), we calculate the reduced gradient of :where denotes the reduced gradient of . Let be the largest element of the vector , and be the corresponding index. Obviously, would be a descent direction. However, if that makes with , then , which does not meet the nonnegative restriction. Therefore, needs to be set to 0. Update descent direction is as follows:
In general, MKL uses a twostep training method. It requires frequent calls to support vector machine solvers, which is prohibitive for large scale problems. Therefore, after each update on , we are not eager to substitute it into support vector machine solvers to update , but continue to look for the maximum allowable step length in this descent direction until the objective function value stops declining. Finally, we get the optimal step length by the line search method. The complete algorithm of the multiNyström method with MKL is summarized in Algorithm 1.

3.3. Kernel Stability Analysis
In some previous related works, Nyström is usually considered as a preprocessing method and mostly only study the approximate error bounds without considering the impact of the approximate on the performance of the kernel machine. In the following, we analyze the kernel stability of our method, bounding the relative performance based on the weighted kernel approximation error. It provides performance guarantees for our multiNyström approximate method in the context of large scale imbalanced classification.
Proposition 1. Let be the optimal solution for kernel SVM with kernel and be the solution of kernel SVM with kernel obtained by Nyström approximation. Then,where is the smallest eigenvalue of , and is the constant from Hoffman’s bound independent on and .
Proof. Define be the projected gradient, where is the bounded constraint and is the convex projection operator. It can be used to define an error bound according to the following theorem:
Theorem 1 (see [34]). Let be the nearest optimal solution of the convex optimization problem:with being strongly convex, being Lipschitz continuous, and is a polyhedral set. The optimization problem admits a global error bound:where is the constant from Hoffman’s bound.
Considering now the problem with and bounded constraint , then
Note that the above problem is equivalent to problem (15) with the equality ( is SPSD), and we have
Let be the dual objective function of multiple kernel learning problem (5) with the original kernel , and be the objective function of approximate multiple kernel learning problem (9) with kernel obtained by our multiNyström method (13). Consider now and as the optimal solutions of and , respectively. We havewhere we use the fact that ; therefore,where is the spectral norm error of the ^{th} Nyström approximate based on the ^{th} subset of landmark points.
Furthermore, we use the inequality of the kernel SVM given by [35] (proof of Theorem 2) along with Theorem 1 to upper bound the norm difference between the optimal solutions of and :
The proposition shows us the norm difference is controlled by a weighted Nyström approximate error. And it guides us to focus on approximating the kernel matrices with greater weights for getting a better learning performance.
4. Experiments
In this section, in order to validate the efficiency of the proposed method in solving large scale imbalanced problems, we compare our method against kernel methods including SVM and MKSVM (multiple kernel SVM), as well as the Nyström approximation method. All experiments are implemented on a PC with Intel quadcore i78565U [email protected] GHz and 8 GB memory.
4.1. Implementation
We implement our experiments on five realworld imbalanced datasets from the KEEL data repository (https://keel.es/) and the LIBSVM archive (https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/) (Table 1). For a fair comparison, we perform 10 times stratified 5fold crossvalidation and report the average result. We use LIBSVM (https://www.csie.ntu.edu.tw/cjlin/libsvm/index.html) and SimpleMKL (https://asi.insarouen.fr/enseignants/arakoto/code/mklindex.html) to run kernel SVM and MKSVM, respectively. As the kernel type, all experiments use the Gaussian kernel with bandwidth in the range of . Because we are interested in relative performance, we empirically set the tradeoff parameter C = 100. In this study, we adopt the following three evaluation measures of the classification performance on imbalanced datasets: F1 score, Gmean, and area under ROC curve (AUC).where TP, TN, FP, and FN represent the number of truepositive, truenegative, falsepositive, and falsenegative instances, respectively. F1 score measures the classification performance on the minority class. Gmean reflects the overall classification performance. AUC works well for comparing performance between algorithms [36].
4.2. Experimental Results
Table 2 provides the average experimental results of the proposed method and the other three algorithms on the four imbalanced datasets using the above three measures. We first compare SVM and the standard Nyström method. The Nyström method uses uniform sampling without replacement to approximate the kernel matrix, which relieves the model’s sensitivity to class imbalance to a certain extent. For example, on the Poker89_vs_5 dataset, in terms of Gmean, the Nyström method improves nearly 7 times more than SVM. However, we can also see that in terms of AUC and F1 score, there still exits a large gap in model accuracy as compared with SVM.
Next, we compare our multiNyström method with the standard Nyström method. The experimental results clearly demonstrate that our method outperforms the Nyström method, especially in the context of extreme imbalance. This mainly benefits from the use of undersampling of the majority class, which can effectively balance the class distribution. Moreover, it can be seen that multiNyström can improve the accuracy of the model. For example, with the same number of landmark points, the F1 score and AUC value of multiNyström on the USPS dataset are closer to that of SVM or even higher on Poker89_vs_5 and Pageblocks0 datasets.
Note that our method is also a type of approximation of MKL, and finally, we also examine the performance of MKLbased MKSVM. From the results, we can see the effect of using MKL to represent input data, which also implicitly explains how our method achieves better accuracy at the expense of more computations.
4.3. Discussion
In this part, we further discuss the impact of different parameters on performance. In the first experiment, in order to study the impact of the number of sampling landmark points on the classification performance, we fix the approximate rank parameter and successively increase the number of sampling landmark points, and then train and test the SVM model on four datasets, with results as shown in Figure 2. We can see that as the number of sampling landmark points increases, although there are some fluctuations, the performance of our method and Nyström still presents a rising trend. Moreover, except for few cases, our method uses fewer landmark points and can still yield higher Gmean.
(a)
(b)
(c)
(d)
In the second experiment, we study the performance with the variance of the rank parameter. Figure 3 shows the Gmean on four datasets by varying the approximate rank. They show us that with the same approximate kernel rank, our method can achieve better classification performance than others.
(a)
(b)
(c)
(d)
Finally, we further compare the running time of our method and MKSVM. We report the results on two datasets USPS and Pageblocks in Figure 4. The results show that our method can significantly speedup the MKL process under guaranteed performance. For example, on the USPS dataset, our method can reduce the running time by more than one order of magnitude. The main reason is due to the lowrank attribute of the approximate kernel matrix that speeds up the MKL algorithm process.
(a)
(b)
For further analysis of the experimental results, we perform the Friedman test with respect to the F1 score. First, we calculate the average ranks of SVM, Nyström, multiNyström, and MKSVM as shown in Figure 5. It can be noticed that MKSVM gives the best performance. Meanwhile, the SVM and the proposed multiNyström rank similarly. In a comparison of algorithms on datasets, considering as the average ranking of the algorithm, the Friedman variable can be calculated as follows:withwhere is distributed to and degrees of freedom. For our experiments, . The critical value of is 3.8625 for . Since , we can reject the null hypothesis that all the algorithms have the same performance. Then, we perform the Nemenyi test to compare algorithms pairwise. The critical difference is calculated as follows:considering and . The difference between the average ranking of the SVM, Nyström, and multiNyström with MKSVM is 1.0, 2.75, and 1.25, respectively. Hence, we can state that the best MKSVM is significantly better than Nyström at . However, the difference between the best MKSVM and the proposed multiNyström is not significant, which indicates the proposed method achieves better performance than the standard Nyström kernel classifier and more efficiency than the best MKSVM.
5. Conclusions
In this study, we propose a novel method to overcome the time and memory limitations of the standard Nyström method and extend it to the case of large scale imbalanced classification. In general, kernel approximation and model training are carried out separately. To obtain more accurate results, our method mixes multiple Nyström approximations and embeds them in the model training process to learn the model parameters and mixture weights simultaneously. In particular, the approximate kernel matrix yielded by our method is low rank and balanced. We also provide an error bound of the model solution based on our approximate method to guide us in improving the learning process. Experimental results show that our method can achieve a higher classification accuracy. On the other hand, it can dramatically improve the efficiency of exiting MKL algorithms.
Potential improvements: there are still some caveats in our current solution. For example, due to the curse of kernelization, the number of support vectors grows in an unbounded manner when suffered the nonzero loss. This significantly increases the computational cost and can be infeasible for large scale problems. Future work will chiefly focus on more efficient variants of multiNyström involving budget kernel learning to address the issue.
Data Availability
The data used to support the findings of this study have been deposited in the KEEL repository (http://keel.es/) and the LIBSVM archive (https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/).
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was jointly supported by the National Natural Science Foundation of China (61403397) and Natural Science Basic Research Plan in Shaanxi Province of China (2020JM358 and 2015JM6313).