Abstract
Kernel entropy component analysis (KECA) is a newly proposed dimensionality reduction (DR) method, which has showed superiority in many pattern analysis issues previously solved by principal component analysis (PCA). The optimized KECA (OKECA) is a stateoftheart variant of KECA and can return projections retaining more expressive power than KECA. However, OKECA is sensitive to outliers and accused of its high computational complexities due to its inherent properties of L2norm. To handle these two problems, we develop a new extension to KECA, namely, KECAL1, for DR or feature extraction. KECAL1 aims to find a more robust kernel decomposition matrix such that the extracted features retain information potential as much as possible, which is measured by L1norm. Accordingly, we design a nongreedy iterative algorithm which has much faster convergence than OKECA’s. Moreover, a general semisupervised classifier is developed for KECAbased methods and employed into the data classification. Extensive experiments on data classification and software defect prediction demonstrate that our new method is superior to most existing KECA and PCAbased approaches. Code has been also made publicly available.
1. Introduction
Curse of dimensionality is one of the major issues in machine learning and pattern recognition [1]. It has motivated many scholars from different areas to properly implement dimensionality reduction (DR) to simplify the input space without degrading performances of learning algorithms. Various efficient methods associated with DR have been developed, such as independent component analysis (ICA) [2], linear discriminant analysis [3], principal component analysis (PCA) [4], projection pursuit [5], to name a few. Among these robust algorithms, PCA has been one of the most used techniques to perform feature extraction (or DR). PCA implements linear data transformation according to the projection matrix, which aims to maximize the secondorder statistics of input datasets [6]. To extend PCA to nonlinear space, Schölkopf et al. [7] proposed the kernel PCA, the socalled KPCA method. The key of KPCA is to find the nonlinear relation between the input data and the kernel feature space (KFS) using the kernel matrix, which is derived from a positive semidefinite kernel function of computing inner products. Both PCA and KPCA perform data transformation by selecting the eigenvectors corresponding to the top eigenvalues of the projection matrix and the kernel matrix, respectively. All of them (including their variants) have experienced great success in different areas [8–12], such as image reconstruction [13], face recognition [14–17], image processing [18, 19], to name a few. However, as suggested by Zhang and Hancock [20], the DR should be performed according to the perspective of information theory for obtaining more acceptable results.
To improve performances of the aforementioned approaches to DR, Jessen [6] developed a new and completely different data transformation algorithm, namely, kernel entropy component analysis (KECA). The main difference between KECA and PCA or KPCA is that the optimal eigenvectors (or called entropic components) derived from KECA can compress the most Renyi entropy of the input data instead of being associated with top eigenvalues. The procedure of selecting the eigenvectors related to the Renyi entropy of the input space is started with a Parzen window kernelbased estimator [21]. Then, only the eigenvectors corresponding to the most entropy of the input datasets are selected to perform DR. This distinguished characteristic helps KECA achieve better performances than the classical PCA and KPCA in face recognition and clustering [6]. In recent years, IzquierdoVerdiguier et al. [21] employed the rotation matrix from ICA [2] to optimize KECA and proposed the optimized KECA (OKECA). OKECA not only shows superiority in classification of both synthetic and real datasets but can obtain acceptable kernel density estimation (KDE) just using very fewer entropic components (just one or two) compared with KECA [21]. However, OKECA is sensitive to outliers for its inherent properties of L2norm. In other words, if the input space follows normal distribution and is contaminated by nonnormal distributed outliers, this may lead to the downgrade of its performance on DR in terms of OKECA. Additionally, OKECA is very timeconsuming when handling largescale input datasets (Section 4).
Therefore, the main purpose of this paper is to propose a new variant of KECA and improve the proneness to outliers and efficiency of OKECA. L1norm is well known for its robustness to outliers [22]. Additionally, Nie et al. [23] established a fast iteration process to handle the general L1norm maximization issue with nongreedy algorithm. Hence, we take advantages of OKECA and propose a new L1norm version of KECA (denoted as KECAL1). KECAL1 uses an efficient convergence procedure, motivated by Nie et al.’s method [23], to search for the entropic components contributing to the most Renyi entropy of input data. To evaluate the efficiency and effectiveness of KECAL1, we design and conduct a series of experiments, in which the data vary from single class to multiattribute and from small to large size. The classical KECA and OKECA are also included for comparison.
The remainder of this paper is organized as follows: Section 2 reviews the general L1norm maximization issue, KECA, and OKECA. Section 3 presents KECA with nongreedy L1norm maximization and semisupervisedlearningbased classifier. Section 4 validates the performance of the new method on different data sets. Section 5 ends this paper with some conclusions.
2. Preliminaries
2.1. An Efficient Algorithm to Solving the General L1Norm Maximization Issue
The general L1norm maximization problem is first raised by Nie et al. [23]. This issue, based on a hypothesis that there exists an upper bound for the objective function, can be generally formulated as [23]where both and for each denote arbitrary functions, and represents an arbitrary constraint.
Then a sign function is defined asand employed to transform the maximization problem (1) as follows:where . Nie et al. [23] proposed a fast iteration process to solve problem (3), which is shown in Algorithm 1. It can be seen from Algorithm 1 that is determined by current solution , and the next solution is updated according to the current . The iterative process is repeated until the procedure converges [23, 24]. The convergence of the Algorithm 1 has been demonstrated, and the associated details can also be read in [23].

2.2. Kernel Entropy Component Analysis
KECA is characterized by its entropic components instead of the principal or variancebased components in PCA or KPCA, respectively. Hence, we firstly describe the concept of the Renyi quadratic entropy. Given the input dataset , the Renyi entropy of is defined as [6]where is a probability density function. Based on the monotonic property of logarithmic function, Equation (4) can be rewritten as
We can estimate Equation (5) using the kernel of Parzen window density estimator determined by the bandwidth coefficient [6] such thatwhere constitutes the kernel matrix and represents an dimensional vector containing all ones. With the help of the kernel decomposition [6],
Equation (6) is transformed as follows:where the diagonal matrix and the matrix consist of eigenvalues and the corresponding eigenvectors , respectively. It can be observed from Equation (7) that the entropy estimator consists of projections onto all the KFS axes becausewhere the function of is to map the two samples and into the KFS. Additionally, only an entropic component meeting the criteria of and can contribute to the entropy estimate [21]. In a word, KECA implements DR by projecting into a subspace spanned not by the eigenvectors associated with the top eigenvalues but by entropic components contributing most to the Renyi entropy estimator [25].
2.3. Optimized Kernel Entropy Component Analysis
Due to the fact that KECA is sensitive to different bandwidth coefficients [21], OKECA is proposed to fill this gap and improve performances of KECA on DR. Motivated by the fast ICA method [2], an extra rotation matrix (applying ) is employed to the kernel decomposition (Equation (7)) in KECA for maximizing the information potential (the entropy values in Equation (8)) [21]:where is the L2norm and denotes a column vector () in . IzquierdoVerdiguier et al. [21] utilized a gradientascent approach to handle the maximization problem (10):where is the step size. can be obtained by Lagrangian multiplier:
The entropic components multiplied by the rotation matrix can obtain more (or equal) information potential than that of the KECA even using fewer components [21]. Moreover, OKECA shows the capability of being robust to the bandwidth coefficient. However, there exist two main limitations for OKECA. First, the new entropic components derived from OKECA are sensible to outliers since its inherent properties of L2norm (Equation (10)). Second, although a very simple stopping criterion is designed to avoid additional iterations, OKECA is still of high computational complexities for its computational cost is [21], where is the number of iterations for finding the optimal rotation matrix, compared with that the one of KECA is [21].
3. KECA with Nongreedy L1Norm Maximization
3.1. Algorithm
In order to alleviate the problems existing in OKECA, this section presents how to extend KECA to its nongreedy L1norm version. For readers’ easy understanding, the definition of L1norm is firstly introduced as follows:
Definition 1. Given an arbitrary vector , the L1norm of the vector iswhere is the L1norm and denotes the jth element of .
Then, motivated by OKECA, we attempt to develop a new objective function to maximize the information potential (Equations (8) and (10)) based on the L1norm:where , is the size of samples. The rotation matrix is denoted as , where and are the dimension of input data and dimension of the selected entropic components (or number of projection), respectively. It is difficult to directly solve problem (14), but we may regard it as a special case of problem (1) when . Therefore, the Algorithm 1 can be employed to solve (14). Next, we show the details about how to find the optimal solution of problem (14) based on the proposal from References [23, 24]. Let
Thus, problem (14) can be simplified as
By singular value decomposition (SVD), thenwhere , , and . Then we obtainwhere , and denote the element of matrix and , respectively. Due to the property of SVD, we have . Additionally, is an orthonormal matrix [23] such that . Therefore, can reach the maximum only if , where denotes the identity matrix, and is a matrix of zeros. Considering that , thus the solution to problem (16) is
Algorithm 2 (A MATLAB implementation of the algorithm is available at the Supporting Document for the interested readers) shows how to utilize the nongreedy L1norm maximization described in Algorithm 1 to compute Equation (19). Since problem (16) is a special case of problem (1), we can obviously obtain that the optimal solution to Equation (19) is a local maximum point for based on Theorem 2 in Reference [23]. Moreover, the Phase 1 of the Algorithm 2 spends on the eigen decomposition. Thus, the total of computational cost of KECAL1 is , where is the number of iterations for convergence. Considering that the computational complexity of OKECA is , we can safely conclude that KECAL1 has much faster convergence than OKECA’s.

3.2. The Convergence Analysis
This subsection attempts to demonstrate the convergence of the Algorithm 2 in the following: theorem:
Theorem 1. The above KECAL1 procedure can converge.
Proof. Motivated by References [23, 24], first we show the objective function (9) of KECAL1 will monotonically increase in each iteration . Let and , then (9) can be simplified toObviously, is parallel to , but neither is . Therefore,Considering that , thusSubstituting (22) in (21), it can be obtainedAccording to the Step 3 in Algorithm 2 and the theory of SVD, for each iteration , we haveCombining (23) and (24) for every , we havewhich means that Algorithm 2 is monotonically increasing. Additionally, considering that objective function (14) of KECAL1 has an upper bound within the limited iterations, the KECAL1 procedure will converge.
3.3. The Semisupervised Classifier
Jenssen [26] established a semisupervised learning (SSL) algorithm for classification using KECA. This SSLbased classifier was trained by both labeled and unlabeled data to build the kernel matrix such that it can map the data to KFS appropriately [26]. Additionally, it is based on a general modelling scheme and applicable for other variants of KECA, such as OKECA and KECAL1.
More specifically, we are given pairs of training data with samples and the associated labels . In addition, there are unlabeled data points for testing. Let and denote the testing data and training data without labels, respectively; thus, we can obtain an overall matrix . Then we construct the kernel matrix derived from using (6), , which plays as the input of Algorithm 2. After the iteration procedure of nongreedy L1norm maximization, we obtain a projection of onto orthogonal axes, where and . In other words, and are the lowdimensional representations of each testing data point and the training one , respectively. Assume that is an arbitrary data point to be tested. If it satisfiesthen is assigned to the same class with the jth data point of .
4. Experiments
This section shows the performance of the proposed KECAL1 compared with the classical KECA [6] and OKECA [21] for realworld data classification using the SSLbased classifier illustrated in Section 3.3. Several recent techniques such as PCAL1 [27] and KPCAL1 [28] are also included for comparison. The rationale to select these methods is that previous studies related to DR found that they can produce impressive results [27–29]. We implement the experiments on a wide range of realworld datasets: (1) six different datasets from the University California Irvine (UCI) Machine Learning Repository (available at http://archive.ics.uci.edu/ml/datasets.html) and (2) 9 different software projects with 34 releases from the PROMISE data repository (available at http://openscience.us/repo). The MATLAB source code for running KECA and OKECA, uploaded by IzquierdoVerdiguier et al. [21], is available at http://isp.uv.es/soft_feature.html. The coefficients set for PCAL1 and KPCAL1 is the same with [27, 28]. All of the experiments are all performed by MATLAB R2012a on a PC with Inter Core i5 CPU, 4 GB memory, and Windows 7 operating system.
4.1. Experiments on UCI Datasets
The experiments are conducted on six datasets from the UCI: the Inonosphere dataset is a binary classification problem of whether the radar signal can describe the structure of free electrons in the ionosphere or not; the Letter dataset is to assign each blackandwhite rectangular pixel display to one of the 26 capital letters in the English alphabet; the Pendigits handles the recognition of penbased handwritten digits; the PimaIndians data set constitutes a clinical problem of diabetes diagnosis in patients from clinical variables; the WDBC dataset is another clinical problem for the diagnosis of breast cancer in malignant or benign classes; and the Wine dataset is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. Table 1 shows the details of them. In the subsequent experiments, we just utilized the simplest linear classifier [30]. The theory of maximizing maximum likelihood (ML) [31] is selected as the rule for selecting bandwidth coefficient as suggested in [21].
The implementation of KECAL1 and other methods is repeated using all the selected datasets with respect to different numbers of components for 10 times. We have utilized the overall classification accuracy (OA) to evaluate the performance of different algorithms on the classification. OA is defined as the total number of samples correctly assigned in percentage terms, which is within and indicates better quality with larger values. Figure 1 presents the average OA curves obtained by the aforementioned algorithms for these six real datasets. It can be observed from Figure 1 that OKECA is superior to KECA, PCAL1, and KPCAL1 except for solving Letter issue. This is probably because DR performed by OKECA not only can reveal the structure related to the most Renyi entropy of the original data but also consider the rotational invariance property [21]. In addition, KECAL1 outperforms the other methods besides of OKECA. This may be attributed to the robustness of L1norm to outliers compared with that of the L2norm. In Figure 1(c), OKECA seems to obtain nearly the same results with KECAL1’s. However, the average running time (in hours) of OKECA in the Pendigits is 37.384 times more than that of KECAL1 1.339.
(a)
(b)
(c)
(d)
(e)
(f)
4.2. Experiments on Software Projects
In software engineering, it is usually difficult to test a software project completely and thoroughly with the limited resources [32]. Software defect prediction (SDP) may provide a relatively acceptable solution to this problem. It can allocate the limited test resources effectively by categorizing the software modules into two classes: nonfaultprone (NFP) or faultprone (FP) according to 21 software metrics (Table 2).
This section aims to employ KECAbased methods to reduce the selected software data (Table 3) dimensions and then utilize the SSLbased classifier combined with the support vector machine [33] to classify each software module as NFP or FP. The bandwidth coefficient set is still restricted to the rule of ML. PCAL1 and KPCAL1 are involved as a benchmarking yardstick. There are 34 groups of tests for each release in Table 3. The most suitable releases [34] from different software projects are selected as training data. We evaluate the performance of different selected methods on SDP in terms of recall (R), precision (P), and Fmeasure (F) [35, 36]. The Fmeasure is defined aswhere
In (28), FN (i.e., false negative) means that buggy classes are wrongly classified to be nonfaulty, while FP (i.e., false positive) means nonbuggy classes are wrongly classified to be faulty. TP (i.e., true positive) refer to correctly classified buggy classes [34]. Values of Recall, Precision, and Fmeasure range from 0 to 1 and higher values indicate better classification results.
Figure 2 shows the results using boxplot analysis. From Figure 2, considering the minimum, maximum, median, first quartile, and third quartile of the boxes, we find that KECAL1 performs better than the other methods in general. Specifically, KECAL1 can obtain acceptable results in experiments for SDP compared with the benchmarks proposed in Reference [34], since the median values of the boxes with respect to R and F are close to 0.7 and more than 0.5, respectively. On the contrary, not only KECA and OKECA but PCAL1 and KPCAL1 cannot meet these criteria. Therefore, all of the results validate the robustness of KECAL1.
5. Conclusions
This paper proposes a new extension to the OKECA approach for dimensional reduction. The new method (i.e., KECAL1) employs L1norm and a rotation matrix to maximize information potential of the input data. In order to find the optimal entropic kernel components, motivated by Nie et al.’s algorithm [23], we design a nongreedy iterative process which has much faster convergence than OKECA’s. Moreover, a general semisupervised learning algorithm has been established for classification using KECAL1. Compared with several recently proposed KECA and PCAbased approaches, this SSLbased classifier can remarkably promote the performance on realworld datasets classification and software defect prediction.
Although KECAL1 has achieved impressive success on real examples, several problems still should be considered and solved in the future research. The efficiency of KECAL1 has to be optimized for it is relatively timeconsuming compared with most existing PCAbased methods. Additionally, the utilization of KECAL1 is expected to appear in each pattern analysis algorithm previously based on PCA approaches.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (Grant no. 61702544) and Natural Science Foundation of Jiangsu Province of China (Grant no. BK20160769).
Supplementary Materials
The MATLAB toolbox of KECAL1 is available. (Supplementary Materials)