Abstract

In this paper, we describe a novel approach to sparse principal component analysis (SPCA) via a nonconvex sparsity-inducing fraction penalty function SPCA (FP-SPCA). Firstly, SPCA is reformulated as a fraction penalty regression problem model. Secondly, an algorithm corresponding to the model is proposed and the convergence of the algorithm is guaranteed. Finally, numerical experiments were carried out on a synthetic data set, and the experimental results show that the FP-SPCA method is more adaptable and has a better performance in the tradeoff between sparsity and explainable variance than SPCA.

1. Introduction

Principal component analysis (PCA) [1] has become more popular in data analysis, dimension reduction, image processing, and feature extraction [2]. It seeks the linear combinations of the original variables such that the derived variables capture maximal variance of the total original variables. We can obtain PCA(s) of an original variable by the singular value decomposition (SVD) of the data matrix which is composed of the observed variables. Let be an matrix, where and are the number of observations and the number of variables, respectively. Suppose the average value of columns of are all zero and the singular value decomposition of is

Then, the columns of matrix are the principal components of , and the columns of are the corresponding loadings of the principal components. The PCA method is popular due to the following two properties: first, principal components capture the maximum variability from the columns of matrix , which guarantees minimal information loss; second, principal components are uncorrelated, so we can use one of them without considering the others.

At the same time, however, PCA has its own drawback, i.e., each principal component of matrix is a linear combination of all variables and the elements of the loading vectors (the columns of matrix ) are usually nonzero, which make them difficult to interpret the gained PCs.

Based on the abovementioned priority properties and drawback, some scholars consider it will be a wise option to keep the dimension reduction property at the same time to reduce the number of explainable variables. For this purpose, one method is to set the loading whose absolute value is smaller than a threshold to zero.

The same problem also arises in multiple linear regression, where the response variable is explained by a linear combination of the explainable variables. But, which are the important explainable variables that account for response variable information most? To answer this question, researchers have proposed several approaches, and the first type method is the lasso-based approaches [36]. Others, for example, Efron et al. [7] proposed the LARS which is sensitive to noise of samples in 2004, and Friedman et al. [8] put forward a pathwise coordinate optimization approach in 2007. The second type is the lasso-bootstrap methods [9, 10]. Other methods including overview type articles [1113] and Radchenko et al. [14] proposed a variable inclusion and shrinkage method, Fan and Li [15] proposed a nonconcave penalized likelihood method, and Candès and Tao [16] proposed a dantzig selector method. Among them, the lasso method [5] is a well-known variable selection technique which produces an accurate and sparse model. As time went on, Zou and Hastie [4] proposed the elastic net in 2005, an adaptive method, which has the advantages of both lasso and ridge regression. The elastic net approach can be considered as a generalization of lasso and ridge regression. In 2006, Zou et al. [3] proposed sparse principal analysis (SPCA) using the lasso and ridge regression to produce a modified principal component with sparse loadings. They show that PCA can be formulated as a regression-type optimization problem and sparse loadings are obtained by imposing -norm and -norm regularization on the regression coefficients of the model. In 2009, Leng and Wang [17] proposed a method of simple adaptive sparse principal component analysis, which uses the adaptive lasso penalty instead of the lasso penalty in SPCA. Moreover, they replace the least-squares objective function in SPCA by a general least-squares objective function, which leads to study many related SPCA estimators under a unified theoretical framework and get the method of general adaptive sparse principal component analysis. SPCA can produce a sparse model that has only fewer active coefficients, and the other coefficients are all set to zero. The proposed sparse model SPCA is more interpretable than PCA. In a word, SPCA can specifically identify structure information of the data matrix.

The idea of making the model’s coefficients sparse is not a new one. Jollife et al. [18] first proposed a SPCA method using -norm regularization which leads to a variety of sparsity-inducing methods. The success of SPCA in gaining more interpretable model motivates us to propose the method in this paper, where we proposed a fractional function penalty method with respect to sparse PCA. The method is to exploit the fractional function [19] to penalize the model coefficients in order to obtain sparse coefficients, which makes the PCA to have more interpretable ability.

The rest of the paper is organized as follows. In Section 2, definition and derivation of principal components are reviewed. The sparse principal component is presented in Section 3. In Section 4, we propose the fractional functional penalized principal components. Numerical experiments and conclusions are formulated in Sections 5 and 6, respectively.

In this paper, without specification, scalar is denoted as lower case letter , and vector is denoted as bold lower case letter . The matrix is denoted by bold capital letter , and denotes the identity matrix. The transpose of a real matrix is denoted by . The Frobenius norm of is denoted by , where denotes column vector or matrix, and the spectral norm of a matrix is denoted by .

2. Principal Components and Sparse Principal Components

2.1. Definition of Derivation of Principal Components

Let be a vector of p random variables, and the variances of the p random variables and the covariance between the p variables will be of importance. If p is large, an alternative key approach is to seek for fewer variables that preserve well most of the information provided by variances and covariances of these p variables. One of the alternative techniques is principle component analysis (PCA) which concerns more variance of variables than covariance. PCA first looks for a linear function owning to maximum variance of , where , that is,

Then, in the same way, we can find a linear function , in which and are linear independence, so that at the -th step, a linear function can be found which has maximum variance and is linear independence with the former linear functions. The is the -th principal component (PC) [1]. Many of the estimation problems in array signal processing can also start with the same issue in (2); however, there may be exist some perturbations [2022]. Having known what are the PCs, the remaining question is how to find them. If the vector of random variables has a known covariance matrix whose th entry is the known covariance of random variables and when , and variance of the element of when . When is unknown, the more realistic method is to substitute for sample covariance matrix .

In fact, because is often unknown, we consider the case in the following only.

To obtain the PCs of the sampling matrix , where and are the number of observations and variables of random vector , respectively. Consider the coefficient vector , which is forced to maximize the sample variance of , where is normalized. Let denote the -th column of sampling matrix , and is zero-centered, . The problem can be reformulated as the follows:

and solving (3), we can obtain the relationship ; furthermore, we havewhere is the largest eigenvalue of matrix and is the eigenvector of corresponding to . is called the 1-th PC of sampling matrix .

In the same way, the -th PC of is and the sample variance of is , where is the -th largest eigenvalue of and is the corresponding eigenvector.

In general, if matrix is known, , the -th column of , is zero-centered. The singular value decomposition of is . Then, the principal components of matrix , which are denoted by , are , where is the matrix composed of the first columns of the matrix and is the rank of . So, we can get [1, 3], where and are th columns of matrix and , respectively.

Although principal component analysis is widely used in various technical fields, it has an obvious drawback, that is, each principal component is a linear combination of all original variables, but it is often difficult to interpret the result. To solve this problem, Zou et al. [3] proposed sparse principal component analysis. They first consider PCA as a linear regression question.

2.2. Preparation of Sparse Principal Components
2.2.1. Linear Regression and LASSO

Consider matrix with observations of the variables, and let be the response vector and the predictor variables, where is the -th column of matrix . Assume and have zero means. The question of evaluating of PCA of matrix can be considered as a linear regression problem:

This model is simple, and the result is easy to get in some cases; however, the result is difficult to interpret [3]. Based on model (5), Tibshirani [5] proposed the lasso method:where and .

Model (6) can produce accurate and sparse result at the same time, which strengths the result’s interpretable ability. However, the lasso method has some limitations, and the most relevant one is that the number of variables produced by the lasso is limited by the number of observations. Given , when the number of rows and number of columns of satisfy , the lasso can select explainable variables at most.

To overcome the abovementioned drawback, Zou and Hastie [4] proposed the elastic net model:where , , and .

The elastic net model shares the properties of lasso and ridge regression at the same time because when the elastic is the lasso, and when , it becomes the ridge regression. When , by choosing proper , the elastic net model can potentially include all variables.

2.2.2. Ridge Regression and the Naive Elastic Net

In model (6) and (7), the sparse coefficients are all -norm-induced and the sparsity is none of -norm’s business. Jolliffe et al. [18] proposed the SCoTLASS method:where and is a parameter. This method has the ability to obtain sparse loadings by imposing -norm penalty on principle component analysis (PCA).

The SCoTLASS method has two drawbacks. One is difficulty of choosing parameter and another problem is the high computational cost. So, Zou et al. [3] consider a modified method. They first show that principle component analysis (PCA) can be expressed as a ridge regression problem.

Suppose the singular value decomposition of is . Let and be the -th column of matrix and the -th element of diagonal matrix , respectively, and then, is the -th principal component of matrix , . The ridge estimation of is given by the following expression:

Let , and then, , where and is the -th column of .

This conclusion shows the relationship between principal component analysis and the regression method. The term in (9) is to guarantee the unique of when the inverse of does not exist. Furthermore, Zou and Hastie [4] proposed a method to add -norm penalty to model (9) and got the following optimization problem:

They call the is an approximation of , and is the approximated principal component. To distinguish this model from model (7), model (10) is called the naive elastic net.

2.3. Sparse Principle Component Based on PCA

Based on the general conclusion in the end of Section 2.1, Zou et al. [3] proposed a two-stage analysis: perform PCs first, and then, find suitable spare approximation of PCs.

Consider the first principal components of , where . Let , , andfor any . Then, , where is the estimation of .

Then, we can perform SPCA, which is based on the PCA and regression method, by adding -norm penalty to the regression coefficients to induce sparse loadings. Thus, the following result is obtained:subject to . Where is the column of matrix and is the column of matrix , .

3. Fractional Function-Penalized Sparse Principal Components

3.1. Fraction Function-Penalized Method

Recently, Li et al. [19] proposed a fraction-penalized function. They study the problem of a nonconvex sparsity promoting penalty function,to relax the -norm in compressed sensing [15, 16, 23, 24] to get the sparsity signal, where is a parameter and . is a fraction function which is nonconvex and sparse promoting. They studied the properties of this fractional function and derived a closed form expression of the optimal solution to the fractional function penalty problem and proposed an iterative thresholding algorithm to solve the fraction function penalty problem. A series of experiments have been conducted and the results show, compared with the soft thresholding algorithm and the half thresholding algorithms, that the proposed algorithm has the property of better sparse signal recovery with and without measurement noise.

Based on the good performance of the fraction function in sparse promotion in compressed sensing [23, 25], in this paper, we shall adopt this method to the sparse principal component analysis (SPCA), i.e., replace the -norm penalty in SPCA with the fraction function penalty.

Combining fd12(12) and (13)fd13, we can get

We now introduce the following fact for the purpose of solving the problem (14).

Owing to , let be the matrix satisfying a orthonormal matrix, which is a block matrix. Let . Then,whileso

So, problem (14) can be transformed into

Problem (19) can be solved by an alternating algorithm. If is given, can be estimated byand if is given, problem (14) is to solvewhere is the -th column of matrix . According to the procrustes rotation [2628], the estimation of can be obtained; suppose the SVD of matrix is , by [3], that is, .

3.2. Transformation of Problem (16)

In (20), let be given by for , where is the -th column of -the matrix in the which is the singular decomposition of the matrix . Let , and (22) can be reformulated as

Let and . Then, (22) can be rewritten as

The relationship of and can be explained as the following. Let

According to the expressions of and the definition of , we can get

Owing to , it is easy to obtain that , so thatthat is,

Both and are continuous functions with respect to and , respectively, so as , we have , and further, we get . That is, as .

So, we conclude that can be approximated well by when positive is small enough.

Remark 1. The reason why we substitute with is model (24) is easier to solve than model (22).
Now, we will show that the optimal solution to the minimization problem,can be expressed as a thresholding operation.

3.3. The Algorithm of the Problem (29)

For any , and , let

We have the following results.

Lemma 1. For any fixed parameter and , if is a local minimizer to , thenwhere ,and according to [19],

Proof. We can see that can be expressed asthat is to say, minimizing for , given , , and , is equivalent toSo, is a local minimizer of if and only if solves the following problem:According to Lemma 10 in [19], we complete the proof.
Furthermore, we can get Theorem 1.

Theorem 1. If is an optimal solution to (29), and is positive value and satisfies , and then, the optimal solution can be given bywhere ,

Proof. According to , we haveThat is to say, for , is a local minimizer of as long as is a solution to (29). According to Lemma 10 in [19] and Lemma 1, we complete the proof.
With the expression of in Theorem 1, we can get an alternating thresholding algorithm for problem (29), that is,for , where and is the thresholding operator in Lemma 1. We call this algorithm the alternating iterative FP-SPCA thresholding algorithm.

Remark 2. The dot in may be a real number as in Lemma 1 or a vector whose components are all real numbers. If the dot represents a vector, means the result vector which components are the result that acts on every component of the vector sequentially.
It is well known that the solution to problem (29) depends on the choice of the regularization parameter . Before we give out the proper choice for the regularization parameter , we need the following definition.

Definition 1. The nonincreasing rearrangement of the vector is the vector for whichand there is a permutation with for all , and is the -th component of the vector .
Similar to the choice of the regularization parameter in FP algorithm (see Scheme 2 in [19]), here, we can approximate by using the current step value and the can be chosen aswhere is a small positive number, such as 0.1, 0.01, or 0.001, and is sparsity degree of . By this way, the iterative algorithm is adaptive and free from the regular parameter choice. Noticing that in the (2) of Theorem 2, we take , where .
With abovementioned parameter-choosing method, according to the iterative method of (40), we have the following alternating iterative FP-SPCA thresholding algorithm. If is given, for each , is the same as in formula (24), by using the iterative formula (40), is obtained. If is fixed, according to (21), it is only try to minimize  =  with respect to , where . According to the reduced rank form of the Procrustes rotation in [26], if the singular value decomposition of matrix is , the solution is [3]. The alternating iterative FP-SPCA thresholding algorithm is described in Algorithm 1.
At last, we discuss the convergence of the iterative FP-SPCA thresholding algorithm which converges to a stationary point of the iteration (40).

Initialize: Choose , , , , , , , and .
while not converged do
 for
  
  ;
  if then
   ;
  else
   ;
  end
  ;
  for
  if
   
  else
   
  end
 end
 end;
end while
for
end
output:.

Theorem 2. Let be the sequence generated by the iterative FP-SPCA thresholding algorithm (40) with . Then,(i)The sequence is monotonically decreasing(ii) is asymptotically regular, i.e., (iii) converges to a stationary point of the iteration (40)

Proof. The proof is similar to the proof of Theorem 4.1 in [29] and Lemma 2 in [30], so it is omitted.

4. Numerical Experiments

In this section, we present numerical results of the FP-SPCA method and compare the result with the SPCA method [3]. The simulations are conducted on a personal computer (Intel(R) core(TM) i7-6700 [email protected] 3.40GHZ, RAM 16.0 GB) with MATLAB R2014a programming platform.

4.1. Experimental Preparation

In order to compare with the SPCA method, we use the synthetic data as in [3]. The synthetic data has three hidden factors,where , , and are independent.

The 10 observable variables are generated as follows:where are independent, and . We use the matrix to perform PCA, SPCA, and FP-SPCA.

The three hidden factors have three different variance 290, 300, and 283.7875. The number of variables with respect to the three factors are 4, 4, and 2, so and are nearly equally important, and they are more important than . These facts suggest that we need to consider the relatively important variables with proper sparse representations only. The two derived variables should recover the factors and well using and accordingly.

The SPCA and FP-SPCA methods were carried out by the method proposed by Zou et al. [3] and the fractional penalty method proposed in [19]. We compared the performance of the two methods using the synthetic data matrix and summarize the results in two cases below.

4.2. Parameter Selection and the Result of Experiments

Case 1. In this situation, the method proposed by Zou et al. [3] which is called SPCA and our proposed method which is called FP-SPCA have the same performance for the almost equally important hidden factors. That is, for most of the generated data matrices, the SPCA and FP-SPCA have the same loadings but different adjusted variance.
Table 1 reports a numerical result of PCA, SPCA, and FP-SPCA and summarizes the results in loadings and adjusted variance. The parameters are in SPCA and in FP-SPCA. We can see, from Table 1, that both SPCA and FP-SPCA have same orthogonal loadings of PC1 and PC2 after normalization, but the adjusted variance obtained from FP-SPCA is bigger than that from SPCA. That is to say, PC1 and PC2 obtained by FP-SPCA method have more information than PC1 and PC2 by SPCA. Numerical simulations using other generated data matrices have the same result also.

Case 2. For a part of the generated matrices, according to our experimental results, the FP-SPCA does work, but SPCA does not work well. That is, by adjusting parameters and in the SPCA method, we cannot get the sparse loadings of PC1 or PC2. The parameters and the corresponding numerical results are reported in Table 2, where capital letter N represents no sparse loadings of PC1 and PC2 have been obtained when parameters are the value above the N and the value(s) on the left-hand side of the N. NaN means not a number when parameters acquire the value(s) in the same way as capital letter N in Table 2. Also, we can see, from Table 2, that the loadings of PC1 or PC2 show a meaningless result when is between 3.5 and 4.0. Numerical simulations show the same meaningless result when is increasing by a large degree. We give, in Table 3, a result of loadings of PC1 and PC2 when using the SPCA method and the result of the FP-SPCA method when . In a word, the FP-SPCA method can find the sparse loadings of PC1 and PC2 while the SPCA method is at least difficult to obtain the sparse result in this case.
From Case 1 and Case 2, it can be seen that the FP-SPCA method is more adaptable than the SPCA method. The reason why the FP-SPCA method is more preferable is that it has a parameter in the penalty function which makes FP-SPCA more flexible and easier to adjust to get the desirable result.

5. Conclusions

In this paper, the FP-SPCA method is proposed based on the SPCA method [3, 4], where a fractional penalty function [19] is proposed to replace the -norm in the SPCA method. Numerical simulations show that the proposed FP-SPCA method is more adaptable and flexible than the SPCA method. However, the FP-SPCA has its limitation, and during the numerical simulations, we find a few generated matrices that FP-SPCA fails to produce sparse loadings of principal components and the SPCA method does not work also. How to deal with this problem and how to compare the method we proposed with other existing penalty methods are remaining questions that will be worth investigating further.

Data Availability

The data used to support the findings of this study are available from the corresponding author on request.

Conflicts of Interest

The authors declare that there are no known conflicts of interest associated with this publication.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant nos. 11771347, 11871392, 91730306, and 41390454.