Abstract
In this paper, we describe a novel approach to sparse principal component analysis (SPCA) via a nonconvex sparsityinducing fraction penalty function SPCA (FPSPCA). Firstly, SPCA is reformulated as a fraction penalty regression problem model. Secondly, an algorithm corresponding to the model is proposed and the convergence of the algorithm is guaranteed. Finally, numerical experiments were carried out on a synthetic data set, and the experimental results show that the FPSPCA method is more adaptable and has a better performance in the tradeoff between sparsity and explainable variance than SPCA.
1. Introduction
Principal component analysis (PCA) [1] has become more popular in data analysis, dimension reduction, image processing, and feature extraction [2]. It seeks the linear combinations of the original variables such that the derived variables capture maximal variance of the total original variables. We can obtain PCA(s) of an original variable by the singular value decomposition (SVD) of the data matrix which is composed of the observed variables. Let be an matrix, where and are the number of observations and the number of variables, respectively. Suppose the average value of columns of are all zero and the singular value decomposition of is
Then, the columns of matrix are the principal components of , and the columns of are the corresponding loadings of the principal components. The PCA method is popular due to the following two properties: first, principal components capture the maximum variability from the columns of matrix , which guarantees minimal information loss; second, principal components are uncorrelated, so we can use one of them without considering the others.
At the same time, however, PCA has its own drawback, i.e., each principal component of matrix is a linear combination of all variables and the elements of the loading vectors (the columns of matrix ) are usually nonzero, which make them difficult to interpret the gained PCs.
Based on the abovementioned priority properties and drawback, some scholars consider it will be a wise option to keep the dimension reduction property at the same time to reduce the number of explainable variables. For this purpose, one method is to set the loading whose absolute value is smaller than a threshold to zero.
The same problem also arises in multiple linear regression, where the response variable is explained by a linear combination of the explainable variables. But, which are the important explainable variables that account for response variable information most? To answer this question, researchers have proposed several approaches, and the first type method is the lassobased approaches [3–6]. Others, for example, Efron et al. [7] proposed the LARS which is sensitive to noise of samples in 2004, and Friedman et al. [8] put forward a pathwise coordinate optimization approach in 2007. The second type is the lassobootstrap methods [9, 10]. Other methods including overview type articles [11–13] and Radchenko et al. [14] proposed a variable inclusion and shrinkage method, Fan and Li [15] proposed a nonconcave penalized likelihood method, and Candès and Tao [16] proposed a dantzig selector method. Among them, the lasso method [5] is a wellknown variable selection technique which produces an accurate and sparse model. As time went on, Zou and Hastie [4] proposed the elastic net in 2005, an adaptive method, which has the advantages of both lasso and ridge regression. The elastic net approach can be considered as a generalization of lasso and ridge regression. In 2006, Zou et al. [3] proposed sparse principal analysis (SPCA) using the lasso and ridge regression to produce a modified principal component with sparse loadings. They show that PCA can be formulated as a regressiontype optimization problem and sparse loadings are obtained by imposing norm and norm regularization on the regression coefficients of the model. In 2009, Leng and Wang [17] proposed a method of simple adaptive sparse principal component analysis, which uses the adaptive lasso penalty instead of the lasso penalty in SPCA. Moreover, they replace the leastsquares objective function in SPCA by a general leastsquares objective function, which leads to study many related SPCA estimators under a unified theoretical framework and get the method of general adaptive sparse principal component analysis. SPCA can produce a sparse model that has only fewer active coefficients, and the other coefficients are all set to zero. The proposed sparse model SPCA is more interpretable than PCA. In a word, SPCA can specifically identify structure information of the data matrix.
The idea of making the model’s coefficients sparse is not a new one. Jollife et al. [18] first proposed a SPCA method using norm regularization which leads to a variety of sparsityinducing methods. The success of SPCA in gaining more interpretable model motivates us to propose the method in this paper, where we proposed a fractional function penalty method with respect to sparse PCA. The method is to exploit the fractional function [19] to penalize the model coefficients in order to obtain sparse coefficients, which makes the PCA to have more interpretable ability.
The rest of the paper is organized as follows. In Section 2, definition and derivation of principal components are reviewed. The sparse principal component is presented in Section 3. In Section 4, we propose the fractional functional penalized principal components. Numerical experiments and conclusions are formulated in Sections 5 and 6, respectively.
In this paper, without specification, scalar is denoted as lower case letter , and vector is denoted as bold lower case letter . The matrix is denoted by bold capital letter , and denotes the identity matrix. The transpose of a real matrix is denoted by . The Frobenius norm of is denoted by , where denotes column vector or matrix, and the spectral norm of a matrix is denoted by .
2. Principal Components and Sparse Principal Components
2.1. Definition of Derivation of Principal Components
Let be a vector of p random variables, and the variances of the p random variables and the covariance between the p variables will be of importance. If p is large, an alternative key approach is to seek for fewer variables that preserve well most of the information provided by variances and covariances of these p variables. One of the alternative techniques is principle component analysis (PCA) which concerns more variance of variables than covariance. PCA first looks for a linear function owning to maximum variance of , where , that is,
Then, in the same way, we can find a linear function , in which and are linear independence, so that at the th step, a linear function can be found which has maximum variance and is linear independence with the former linear functions. The is the th principal component (PC) [1]. Many of the estimation problems in array signal processing can also start with the same issue in (2); however, there may be exist some perturbations [20–22]. Having known what are the PCs, the remaining question is how to find them. If the vector of random variables has a known covariance matrix whose th entry is the known covariance of random variables and when , and variance of the element of when . When is unknown, the more realistic method is to substitute for sample covariance matrix .
In fact, because is often unknown, we consider the case in the following only.
To obtain the PCs of the sampling matrix , where and are the number of observations and variables of random vector , respectively. Consider the coefficient vector , which is forced to maximize the sample variance of , where is normalized. Let denote the th column of sampling matrix , and is zerocentered, . The problem can be reformulated as the follows:
and solving (3), we can obtain the relationship ; furthermore, we havewhere is the largest eigenvalue of matrix and is the eigenvector of corresponding to . is called the 1th PC of sampling matrix .
In the same way, the th PC of is and the sample variance of is , where is the th largest eigenvalue of and is the corresponding eigenvector.
In general, if matrix is known, , the th column of , is zerocentered. The singular value decomposition of is . Then, the principal components of matrix , which are denoted by , are , where is the matrix composed of the first columns of the matrix and is the rank of . So, we can get [1, 3], where and are th columns of matrix and , respectively.
Although principal component analysis is widely used in various technical fields, it has an obvious drawback, that is, each principal component is a linear combination of all original variables, but it is often difficult to interpret the result. To solve this problem, Zou et al. [3] proposed sparse principal component analysis. They first consider PCA as a linear regression question.
2.2. Preparation of Sparse Principal Components
2.2.1. Linear Regression and LASSO
Consider matrix with observations of the variables, and let be the response vector and the predictor variables, where is the th column of matrix . Assume and have zero means. The question of evaluating of PCA of matrix can be considered as a linear regression problem:
This model is simple, and the result is easy to get in some cases; however, the result is difficult to interpret [3]. Based on model (5), Tibshirani [5] proposed the lasso method:where and .
Model (6) can produce accurate and sparse result at the same time, which strengths the result’s interpretable ability. However, the lasso method has some limitations, and the most relevant one is that the number of variables produced by the lasso is limited by the number of observations. Given , when the number of rows and number of columns of satisfy , the lasso can select explainable variables at most.
To overcome the abovementioned drawback, Zou and Hastie [4] proposed the elastic net model:where , , and .
The elastic net model shares the properties of lasso and ridge regression at the same time because when the elastic is the lasso, and when , it becomes the ridge regression. When , by choosing proper , the elastic net model can potentially include all variables.
2.2.2. Ridge Regression and the Naive Elastic Net
In model (6) and (7), the sparse coefficients are all norminduced and the sparsity is none of norm’s business. Jolliffe et al. [18] proposed the SCoTLASS method:where and is a parameter. This method has the ability to obtain sparse loadings by imposing norm penalty on principle component analysis (PCA).
The SCoTLASS method has two drawbacks. One is difficulty of choosing parameter and another problem is the high computational cost. So, Zou et al. [3] consider a modified method. They first show that principle component analysis (PCA) can be expressed as a ridge regression problem.
Suppose the singular value decomposition of is . Let and be the th column of matrix and the th element of diagonal matrix , respectively, and then, is the th principal component of matrix , . The ridge estimation of is given by the following expression:
Let , and then, , where and is the th column of .
This conclusion shows the relationship between principal component analysis and the regression method. The term in (9) is to guarantee the unique of when the inverse of does not exist. Furthermore, Zou and Hastie [4] proposed a method to add norm penalty to model (9) and got the following optimization problem:
They call the is an approximation of , and is the approximated principal component. To distinguish this model from model (7), model (10) is called the naive elastic net.
2.3. Sparse Principle Component Based on PCA
Based on the general conclusion in the end of Section 2.1, Zou et al. [3] proposed a twostage analysis: perform PCs first, and then, find suitable spare approximation of PCs.
Consider the first principal components of , where . Let , , andfor any . Then, , where is the estimation of .
Then, we can perform SPCA, which is based on the PCA and regression method, by adding norm penalty to the regression coefficients to induce sparse loadings. Thus, the following result is obtained:subject to . Where is the column of matrix and is the column of matrix , .
3. Fractional FunctionPenalized Sparse Principal Components
3.1. Fraction FunctionPenalized Method
Recently, Li et al. [19] proposed a fractionpenalized function. They study the problem of a nonconvex sparsity promoting penalty function,to relax the norm in compressed sensing [15, 16, 23, 24] to get the sparsity signal, where is a parameter and . is a fraction function which is nonconvex and sparse promoting. They studied the properties of this fractional function and derived a closed form expression of the optimal solution to the fractional function penalty problem and proposed an iterative thresholding algorithm to solve the fraction function penalty problem. A series of experiments have been conducted and the results show, compared with the soft thresholding algorithm and the half thresholding algorithms, that the proposed algorithm has the property of better sparse signal recovery with and without measurement noise.
Based on the good performance of the fraction function in sparse promotion in compressed sensing [23, 25], in this paper, we shall adopt this method to the sparse principal component analysis (SPCA), i.e., replace the norm penalty in SPCA with the fraction function penalty.
Combining fd12(12) and (13)fd13, we can get
We now introduce the following fact for the purpose of solving the problem (14).
Owing to , let be the matrix satisfying a orthonormal matrix, which is a block matrix. Let . Then,whileso
So, problem (14) can be transformed into
Problem (19) can be solved by an alternating algorithm. If is given, can be estimated byand if is given, problem (14) is to solvewhere is the th column of matrix . According to the procrustes rotation [26–28], the estimation of can be obtained; suppose the SVD of matrix is , by [3], that is, .
3.2. Transformation of Problem (16)
In (20), let be given by for , where is the th column of the matrix in the which is the singular decomposition of the matrix . Let , and (22) can be reformulated as
Let and . Then, (22) can be rewritten as
The relationship of and can be explained as the following. Let
According to the expressions of and the definition of , we can get
Owing to , it is easy to obtain that , so thatthat is,
Both and are continuous functions with respect to and , respectively, so as , we have , and further, we get . That is, as .
So, we conclude that can be approximated well by when positive is small enough.
Remark 1. The reason why we substitute with is model (24) is easier to solve than model (22).
Now, we will show that the optimal solution to the minimization problem,can be expressed as a thresholding operation.
3.3. The Algorithm of the Problem (29)
For any , and , let
We have the following results.
Lemma 1. For any fixed parameter and , if is a local minimizer to , thenwhere ,and according to [19],
Proof. We can see that can be expressed asthat is to say, minimizing for , given , , and , is equivalent toSo, is a local minimizer of if and only if solves the following problem:According to Lemma 10 in [19], we complete the proof.
Furthermore, we can get Theorem 1.
Theorem 1. If is an optimal solution to (29), and is positive value and satisfies , and then, the optimal solution can be given bywhere ,
Proof. According to , we haveThat is to say, for , is a local minimizer of as long as is a solution to (29). According to Lemma 10 in [19] and Lemma 1, we complete the proof.
With the expression of in Theorem 1, we can get an alternating thresholding algorithm for problem (29), that is,for , where and is the thresholding operator in Lemma 1. We call this algorithm the alternating iterative FPSPCA thresholding algorithm.
Remark 2. The dot in may be a real number as in Lemma 1 or a vector whose components are all real numbers. If the dot represents a vector, means the result vector which components are the result that acts on every component of the vector sequentially.
It is well known that the solution to problem (29) depends on the choice of the regularization parameter . Before we give out the proper choice for the regularization parameter , we need the following definition.
Definition 1. The nonincreasing rearrangement of the vector is the vector for whichand there is a permutation with for all , and is the th component of the vector .
Similar to the choice of the regularization parameter in FP algorithm (see Scheme 2 in [19]), here, we can approximate by using the current step value and the can be chosen aswhere is a small positive number, such as 0.1, 0.01, or 0.001, and is sparsity degree of . By this way, the iterative algorithm is adaptive and free from the regular parameter choice. Noticing that in the (2) of Theorem 2, we take , where .
With abovementioned parameterchoosing method, according to the iterative method of (40), we have the following alternating iterative FPSPCA thresholding algorithm. If is given, for each , is the same as in formula (24), by using the iterative formula (40), is obtained. If is fixed, according to (21), it is only try to minimize = with respect to , where . According to the reduced rank form of the Procrustes rotation in [26], if the singular value decomposition of matrix is , the solution is [3]. The alternating iterative FPSPCA thresholding algorithm is described in Algorithm 1.
At last, we discuss the convergence of the iterative FPSPCA thresholding algorithm which converges to a stationary point of the iteration (40).

Theorem 2. Let be the sequence generated by the iterative FPSPCA thresholding algorithm (40) with . Then,(i)The sequence is monotonically decreasing(ii) is asymptotically regular, i.e., (iii) converges to a stationary point of the iteration (40)
Proof. The proof is similar to the proof of Theorem 4.1 in [29] and Lemma 2 in [30], so it is omitted.
4. Numerical Experiments
In this section, we present numerical results of the FPSPCA method and compare the result with the SPCA method [3]. The simulations are conducted on a personal computer (Intel(R) core(TM) i76700 [email protected] 3.40GHZ, RAM 16.0 GB) with MATLAB R2014a programming platform.
4.1. Experimental Preparation
In order to compare with the SPCA method, we use the synthetic data as in [3]. The synthetic data has three hidden factors,where , , and are independent.
The 10 observable variables are generated as follows:where are independent, and . We use the matrix to perform PCA, SPCA, and FPSPCA.
The three hidden factors have three different variance 290, 300, and 283.7875. The number of variables with respect to the three factors are 4, 4, and 2, so and are nearly equally important, and they are more important than . These facts suggest that we need to consider the relatively important variables with proper sparse representations only. The two derived variables should recover the factors and well using and accordingly.
The SPCA and FPSPCA methods were carried out by the method proposed by Zou et al. [3] and the fractional penalty method proposed in [19]. We compared the performance of the two methods using the synthetic data matrix and summarize the results in two cases below.
4.2. Parameter Selection and the Result of Experiments
Case 1. In this situation, the method proposed by Zou et al. [3] which is called SPCA and our proposed method which is called FPSPCA have the same performance for the almost equally important hidden factors. That is, for most of the generated data matrices, the SPCA and FPSPCA have the same loadings but different adjusted variance.
Table 1 reports a numerical result of PCA, SPCA, and FPSPCA and summarizes the results in loadings and adjusted variance. The parameters are in SPCA and in FPSPCA. We can see, from Table 1, that both SPCA and FPSPCA have same orthogonal loadings of PC1 and PC2 after normalization, but the adjusted variance obtained from FPSPCA is bigger than that from SPCA. That is to say, PC1 and PC2 obtained by FPSPCA method have more information than PC1 and PC2 by SPCA. Numerical simulations using other generated data matrices have the same result also.
Case 2. For a part of the generated matrices, according to our experimental results, the FPSPCA does work, but SPCA does not work well. That is, by adjusting parameters and in the SPCA method, we cannot get the sparse loadings of PC1 or PC2. The parameters and the corresponding numerical results are reported in Table 2, where capital letter N represents no sparse loadings of PC1 and PC2 have been obtained when parameters are the value above the N and the value(s) on the lefthand side of the N. NaN means not a number when parameters acquire the value(s) in the same way as capital letter N in Table 2. Also, we can see, from Table 2, that the loadings of PC1 or PC2 show a meaningless result when is between 3.5 and 4.0. Numerical simulations show the same meaningless result when is increasing by a large degree. We give, in Table 3, a result of loadings of PC1 and PC2 when using the SPCA method and the result of the FPSPCA method when . In a word, the FPSPCA method can find the sparse loadings of PC1 and PC2 while the SPCA method is at least difficult to obtain the sparse result in this case.
From Case 1 and Case 2, it can be seen that the FPSPCA method is more adaptable than the SPCA method. The reason why the FPSPCA method is more preferable is that it has a parameter in the penalty function which makes FPSPCA more flexible and easier to adjust to get the desirable result.
5. Conclusions
In this paper, the FPSPCA method is proposed based on the SPCA method [3, 4], where a fractional penalty function [19] is proposed to replace the norm in the SPCA method. Numerical simulations show that the proposed FPSPCA method is more adaptable and flexible than the SPCA method. However, the FPSPCA has its limitation, and during the numerical simulations, we find a few generated matrices that FPSPCA fails to produce sparse loadings of principal components and the SPCA method does not work also. How to deal with this problem and how to compare the method we proposed with other existing penalty methods are remaining questions that will be worth investigating further.
Data Availability
The data used to support the findings of this study are available from the corresponding author on request.
Conflicts of Interest
The authors declare that there are no known conflicts of interest associated with this publication.
Acknowledgments
This work was supported by the National Natural Science Foundation of China under Grant nos. 11771347, 11871392, 91730306, and 41390454.