Kernel Entropy Component Analysis with Nongreedy L1-Norm Maximization
Kernel entropy component analysis (KECA) is a newly proposed dimensionality reduction (DR) method, which has showed superiority in many pattern analysis issues previously solved by principal component analysis (PCA). The optimized KECA (OKECA) is a state-of-the-art variant of KECA and can return projections retaining more expressive power than KECA. However, OKECA is sensitive to outliers and accused of its high computational complexities due to its inherent properties of L2-norm. To handle these two problems, we develop a new extension to KECA, namely, KECA-L1, for DR or feature extraction. KECA-L1 aims to find a more robust kernel decomposition matrix such that the extracted features retain information potential as much as possible, which is measured by L1-norm. Accordingly, we design a nongreedy iterative algorithm which has much faster convergence than OKECA’s. Moreover, a general semisupervised classifier is developed for KECA-based methods and employed into the data classification. Extensive experiments on data classification and software defect prediction demonstrate that our new method is superior to most existing KECA- and PCA-based approaches. Code has been also made publicly available.
Curse of dimensionality is one of the major issues in machine learning and pattern recognition . It has motivated many scholars from different areas to properly implement dimensionality reduction (DR) to simplify the input space without degrading performances of learning algorithms. Various efficient methods associated with DR have been developed, such as independent component analysis (ICA) , linear discriminant analysis , principal component analysis (PCA) , projection pursuit , to name a few. Among these robust algorithms, PCA has been one of the most used techniques to perform feature extraction (or DR). PCA implements linear data transformation according to the projection matrix, which aims to maximize the second-order statistics of input datasets . To extend PCA to nonlinear space, Schölkopf et al.  proposed the kernel PCA, the so-called KPCA method. The key of KPCA is to find the nonlinear relation between the input data and the kernel feature space (KFS) using the kernel matrix, which is derived from a positive semidefinite kernel function of computing inner products. Both PCA and KPCA perform data transformation by selecting the eigenvectors corresponding to the top eigenvalues of the projection matrix and the kernel matrix, respectively. All of them (including their variants) have experienced great success in different areas [8–12], such as image reconstruction , face recognition [14–17], image processing [18, 19], to name a few. However, as suggested by Zhang and Hancock , the DR should be performed according to the perspective of information theory for obtaining more acceptable results.
To improve performances of the aforementioned approaches to DR, Jessen  developed a new and completely different data transformation algorithm, namely, kernel entropy component analysis (KECA). The main difference between KECA and PCA or KPCA is that the optimal eigenvectors (or called entropic components) derived from KECA can compress the most Renyi entropy of the input data instead of being associated with top eigenvalues. The procedure of selecting the eigenvectors related to the Renyi entropy of the input space is started with a Parzen window kernel-based estimator . Then, only the eigenvectors corresponding to the most entropy of the input datasets are selected to perform DR. This distinguished characteristic helps KECA achieve better performances than the classical PCA and KPCA in face recognition and clustering . In recent years, Izquierdo-Verdiguier et al.  employed the rotation matrix from ICA  to optimize KECA and proposed the optimized KECA (OKECA). OKECA not only shows superiority in classification of both synthetic and real datasets but can obtain acceptable kernel density estimation (KDE) just using very fewer entropic components (just one or two) compared with KECA . However, OKECA is sensitive to outliers for its inherent properties of L2-norm. In other words, if the input space follows normal distribution and is contaminated by nonnormal distributed outliers, this may lead to the downgrade of its performance on DR in terms of OKECA. Additionally, OKECA is very time-consuming when handling large-scale input datasets (Section 4).
Therefore, the main purpose of this paper is to propose a new variant of KECA and improve the proneness to outliers and efficiency of OKECA. L1-norm is well known for its robustness to outliers . Additionally, Nie et al.  established a fast iteration process to handle the general L1-norm maximization issue with nongreedy algorithm. Hence, we take advantages of OKECA and propose a new L1-norm version of KECA (denoted as KECA-L1). KECA-L1 uses an efficient convergence procedure, motivated by Nie et al.’s method , to search for the entropic components contributing to the most Renyi entropy of input data. To evaluate the efficiency and effectiveness of KECA-L1, we design and conduct a series of experiments, in which the data vary from single class to multiattribute and from small to large size. The classical KECA and OKECA are also included for comparison.
The remainder of this paper is organized as follows: Section 2 reviews the general L1-norm maximization issue, KECA, and OKECA. Section 3 presents KECA with nongreedy L1-norm maximization and semisupervised-learning-based classifier. Section 4 validates the performance of the new method on different data sets. Section 5 ends this paper with some conclusions.
2.1. An Efficient Algorithm to Solving the General L1-Norm Maximization Issue
The general L1-norm maximization problem is first raised by Nie et al. . This issue, based on a hypothesis that there exists an upper bound for the objective function, can be generally formulated as where both and for each denote arbitrary functions, and represents an arbitrary constraint.
Then a sign function is defined asand employed to transform the maximization problem (1) as follows:where . Nie et al.  proposed a fast iteration process to solve problem (3), which is shown in Algorithm 1. It can be seen from Algorithm 1 that is determined by current solution , and the next solution is updated according to the current . The iterative process is repeated until the procedure converges [23, 24]. The convergence of the Algorithm 1 has been demonstrated, and the associated details can also be read in .
2.2. Kernel Entropy Component Analysis
KECA is characterized by its entropic components instead of the principal or variance-based components in PCA or KPCA, respectively. Hence, we firstly describe the concept of the Renyi quadratic entropy. Given the input dataset , the Renyi entropy of is defined as where is a probability density function. Based on the monotonic property of logarithmic function, Equation (4) can be rewritten as
We can estimate Equation (5) using the kernel of Parzen window density estimator determined by the bandwidth coefficient  such thatwhere constitutes the kernel matrix and represents an -dimensional vector containing all ones. With the help of the kernel decomposition ,
Equation (6) is transformed as follows:where the diagonal matrix and the matrix consist of eigenvalues and the corresponding eigenvectors , respectively. It can be observed from Equation (7) that the entropy estimator consists of projections onto all the KFS axes becausewhere the function of is to map the two samples and into the KFS. Additionally, only an entropic component meeting the criteria of and can contribute to the entropy estimate . In a word, KECA implements DR by projecting into a subspace spanned not by the eigenvectors associated with the top eigenvalues but by entropic components contributing most to the Renyi entropy estimator .
2.3. Optimized Kernel Entropy Component Analysis
Due to the fact that KECA is sensitive to different bandwidth coefficients , OKECA is proposed to fill this gap and improve performances of KECA on DR. Motivated by the fast ICA method , an extra rotation matrix (applying ) is employed to the kernel decomposition (Equation (7)) in KECA for maximizing the information potential (the entropy values in Equation (8)) :where is the L2-norm and denotes a column vector () in . Izquierdo-Verdiguier et al.  utilized a gradient-ascent approach to handle the maximization problem (10):where is the step size. can be obtained by Lagrangian multiplier:
The entropic components multiplied by the rotation matrix can obtain more (or equal) information potential than that of the KECA even using fewer components . Moreover, OKECA shows the capability of being robust to the bandwidth coefficient. However, there exist two main limitations for OKECA. First, the new entropic components derived from OKECA are sensible to outliers since its inherent properties of L2-norm (Equation (10)). Second, although a very simple stopping criterion is designed to avoid additional iterations, OKECA is still of high computational complexities for its computational cost is , where is the number of iterations for finding the optimal rotation matrix, compared with that the one of KECA is .
3. KECA with Nongreedy L1-Norm Maximization
In order to alleviate the problems existing in OKECA, this section presents how to extend KECA to its nongreedy L1-norm version. For readers’ easy understanding, the definition of L1-norm is firstly introduced as follows:
Definition 1. Given an arbitrary vector , the L1-norm of the vector iswhere is the L1-norm and denotes the jth element of .
Then, motivated by OKECA, we attempt to develop a new objective function to maximize the information potential (Equations (8) and (10)) based on the L1-norm:where , is the size of samples. The rotation matrix is denoted as , where and are the dimension of input data and dimension of the selected entropic components (or number of projection), respectively. It is difficult to directly solve problem (14), but we may regard it as a special case of problem (1) when . Therefore, the Algorithm 1 can be employed to solve (14). Next, we show the details about how to find the optimal solution of problem (14) based on the proposal from References [23, 24]. Let
Thus, problem (14) can be simplified as
By singular value decomposition (SVD), thenwhere , , and . Then we obtainwhere , and denote the element of matrix and , respectively. Due to the property of SVD, we have . Additionally, is an orthonormal matrix  such that . Therefore, can reach the maximum only if , where denotes the identity matrix, and is a matrix of zeros. Considering that , thus the solution to problem (16) is
Algorithm 2 (A MATLAB implementation of the algorithm is available at the Supporting Document for the interested readers) shows how to utilize the nongreedy L1-norm maximization described in Algorithm 1 to compute Equation (19). Since problem (16) is a special case of problem (1), we can obviously obtain that the optimal solution to Equation (19) is a local maximum point for based on Theorem 2 in Reference . Moreover, the Phase 1 of the Algorithm 2 spends on the eigen decomposition. Thus, the total of computational cost of KECA-L1 is , where is the number of iterations for convergence. Considering that the computational complexity of OKECA is , we can safely conclude that KECA-L1 has much faster convergence than OKECA’s.
3.2. The Convergence Analysis
This subsection attempts to demonstrate the convergence of the Algorithm 2 in the following: theorem:
Theorem 1. The above KECA-L1 procedure can converge.
Proof. Motivated by References [23, 24], first we show the objective function (9) of KECA-L1 will monotonically increase in each iteration . Let and , then (9) can be simplified toObviously, is parallel to , but neither is . Therefore,Considering that , thusSubstituting (22) in (21), it can be obtainedAccording to the Step 3 in Algorithm 2 and the theory of SVD, for each iteration , we haveCombining (23) and (24) for every , we havewhich means that Algorithm 2 is monotonically increasing. Additionally, considering that objective function (14) of KECA-L1 has an upper bound within the limited iterations, the KECA-L1 procedure will converge.
3.3. The Semisupervised Classifier
Jenssen  established a semisupervised learning (SSL) algorithm for classification using KECA. This SSL-based classifier was trained by both labeled and unlabeled data to build the kernel matrix such that it can map the data to KFS appropriately . Additionally, it is based on a general modelling scheme and applicable for other variants of KECA, such as OKECA and KECA-L1.
More specifically, we are given pairs of training data with samples and the associated labels . In addition, there are unlabeled data points for testing. Let and denote the testing data and training data without labels, respectively; thus, we can obtain an overall matrix . Then we construct the kernel matrix derived from using (6), , which plays as the input of Algorithm 2. After the iteration procedure of nongreedy L1-norm maximization, we obtain a projection of onto orthogonal axes, where and . In other words, and are the low-dimensional representations of each testing data point and the training one , respectively. Assume that is an arbitrary data point to be tested. If it satisfiesthen is assigned to the same class with the jth data point of .
This section shows the performance of the proposed KECA-L1 compared with the classical KECA  and OKECA  for real-world data classification using the SSL-based classifier illustrated in Section 3.3. Several recent techniques such as PCA-L1  and KPCA-L1  are also included for comparison. The rationale to select these methods is that previous studies related to DR found that they can produce impressive results [27–29]. We implement the experiments on a wide range of real-world datasets: (1) six different datasets from the University California Irvine (UCI) Machine Learning Repository (available at http://archive.ics.uci.edu/ml/datasets.html) and (2) 9 different software projects with 34 releases from the PROMISE data repository (available at http://openscience.us/repo). The MATLAB source code for running KECA and OKECA, uploaded by Izquierdo-Verdiguier et al. , is available at http://isp.uv.es/soft_feature.html. The coefficients set for PCA-L1 and KPCA-L1 is the same with [27, 28]. All of the experiments are all performed by MATLAB R2012a on a PC with Inter Core i5 CPU, 4 GB memory, and Windows 7 operating system.
4.1. Experiments on UCI Datasets
The experiments are conducted on six datasets from the UCI: the Inonosphere dataset is a binary classification problem of whether the radar signal can describe the structure of free electrons in the ionosphere or not; the Letter dataset is to assign each black-and-white rectangular pixel display to one of the 26 capital letters in the English alphabet; the Pendigits handles the recognition of pen-based handwritten digits; the Pima-Indians data set constitutes a clinical problem of diabetes diagnosis in patients from clinical variables; the WDBC dataset is another clinical problem for the diagnosis of breast cancer in malignant or benign classes; and the Wine dataset is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. Table 1 shows the details of them. In the subsequent experiments, we just utilized the simplest linear classifier . The theory of maximizing maximum likelihood (ML)  is selected as the rule for selecting bandwidth coefficient as suggested in .
The implementation of KECA-L1 and other methods is repeated using all the selected datasets with respect to different numbers of components for 10 times. We have utilized the overall classification accuracy (OA) to evaluate the performance of different algorithms on the classification. OA is defined as the total number of samples correctly assigned in percentage terms, which is within and indicates better quality with larger values. Figure 1 presents the average OA curves obtained by the aforementioned algorithms for these six real datasets. It can be observed from Figure 1 that OKECA is superior to KECA, PCA-L1, and KPCA-L1 except for solving Letter issue. This is probably because DR performed by OKECA not only can reveal the structure related to the most Renyi entropy of the original data but also consider the rotational invariance property . In addition, KECA-L1 outperforms the other methods besides of OKECA. This may be attributed to the robustness of L1-norm to outliers compared with that of the L2-norm. In Figure 1(c), OKECA seems to obtain nearly the same results with KECA-L1’s. However, the average running time (in hours) of OKECA in the Pendigits is 37.384 times more than that of KECA-L1 1.339.
4.2. Experiments on Software Projects
In software engineering, it is usually difficult to test a software project completely and thoroughly with the limited resources . Software defect prediction (SDP) may provide a relatively acceptable solution to this problem. It can allocate the limited test resources effectively by categorizing the software modules into two classes: nonfault-prone (NFP) or fault-prone (FP) according to 21 software metrics (Table 2).
This section aims to employ KECA-based methods to reduce the selected software data (Table 3) dimensions and then utilize the SSL-based classifier combined with the support vector machine  to classify each software module as NFP or FP. The bandwidth coefficient set is still restricted to the rule of ML. PCA-L1 and KPCA-L1 are involved as a benchmarking yardstick. There are 34 groups of tests for each release in Table 3. The most suitable releases  from different software projects are selected as training data. We evaluate the performance of different selected methods on SDP in terms of recall (R), precision (P), and F-measure (F) [35, 36]. The F-measure is defined aswhere
In (28), FN (i.e., false negative) means that buggy classes are wrongly classified to be nonfaulty, while FP (i.e., false positive) means nonbuggy classes are wrongly classified to be faulty. TP (i.e., true positive) refer to correctly classified buggy classes . Values of Recall, Precision, and F-measure range from 0 to 1 and higher values indicate better classification results.
Figure 2 shows the results using box-plot analysis. From Figure 2, considering the minimum, maximum, median, first quartile, and third quartile of the boxes, we find that KECA-L1 performs better than the other methods in general. Specifically, KECA-L1 can obtain acceptable results in experiments for SDP compared with the benchmarks proposed in Reference , since the median values of the boxes with respect to R and F are close to 0.7 and more than 0.5, respectively. On the contrary, not only KECA and OKECA but PCA-L1 and KPCA-L1 cannot meet these criteria. Therefore, all of the results validate the robustness of KECA-L1.
This paper proposes a new extension to the OKECA approach for dimensional reduction. The new method (i.e., KECA-L1) employs L1-norm and a rotation matrix to maximize information potential of the input data. In order to find the optimal entropic kernel components, motivated by Nie et al.’s algorithm , we design a nongreedy iterative process which has much faster convergence than OKECA’s. Moreover, a general semisupervised learning algorithm has been established for classification using KECA-L1. Compared with several recently proposed KECA- and PCA-based approaches, this SSL-based classifier can remarkably promote the performance on real-world datasets classification and software defect prediction.
Although KECA-L1 has achieved impressive success on real examples, several problems still should be considered and solved in the future research. The efficiency of KECA-L1 has to be optimized for it is relatively time-consuming compared with most existing PCA-based methods. Additionally, the utilization of KECA-L1 is expected to appear in each pattern analysis algorithm previously based on PCA approaches.
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This work was supported by the National Natural Science Foundation of China (Grant no. 61702544) and Natural Science Foundation of Jiangsu Province of China (Grant no. BK20160769).
The MATLAB toolbox of KECA-L1 is available. (Supplementary Materials)
S. Mika, A. Smola, and M. Scholz, “Kernel PCA and de-noising in feature spaces,” Conference on Advances in Neural Information Processing Systems II, vol. 11, pp. 536–542, 1999.View at: Google Scholar
Y. Ke and R. Sukthankar, “PCA-SIFT: a more distinctive representation for local image descriptors,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 506–513, Washington, DC, USA, June-July 2004.View at: Google Scholar
M. Luo, F. Nie, X. Chang, Y. Yang, A. Hauptmann, and Q. Zheng, “Avoiding optimal mean robust PCA/2DPCA with non-greedy L1-norm maximization,” in Proceedings of International Joint Conference on Artificial Intelligence, pp. 1802–1808, New York, NY, USA, July 2016.View at: Google Scholar
F. Nie, J. Yuan, and H. Huang, “Optimal mean robust principal component analysis,” in Proceedings of International Conference on Machine Learning, pp. 1062–1070, Beijing, China, June 2014.View at: Google Scholar
F. Nie, H. Huang, C. Ding, D. Luo, and H. Wang, “Robust principal component analysis with non-greedy L1-norm maximization,” in Proceedings of International Joint Conference on Artificial Intelligence, pp. 1433–1438, Barcelona, Catalonia, Spain, July 2011.View at: Google Scholar
R. Jenssen, “Kernel entropy component analysis: new theory and semi-supervised learning,” in Proceedings of IEEE International Workshop on Machine Learning for Signal Processing, pp. 1–6, Beijing, China, September 2011.View at: Google Scholar
W. Krzanowski, Principles of Multivariate Analysis, vol. 23, Oxford University Press (OUP), Oxford, UK, 2000.