Complex Methods Applied to Data Analysis, Processing, and VisualizationView this Special Issue
Two-Phase Incremental Kernel PCA for Learning Massive or Online Datasets
As a powerful nonlinear feature extractor, kernel principal component analysis (KPCA) has been widely adopted in many machine learning applications. However, KPCA is usually performed in a batch mode, leading to some potential problems when handling massive or online datasets. To overcome this drawback of KPCA, in this paper, we propose a two-phase incremental KPCA (TP-IKPCA) algorithm which can incorporate data into KPCA in an incremental fashion. In the first phase, an incremental algorithm is developed to explicitly express the data in the kernel space. In the second phase, we extend an incremental principal component analysis (IPCA) to estimate the kernel principal components. Extensive experimental results on both synthesized and real datasets showed that the proposed TP-IKPCA produces similar principal components as conventional batch-based KPCA but is computationally faster than KPCA and its several incremental variants. Therefore, our algorithm can be applied to massive or online datasets where the batch method is not available.
As a conventional linear subspace analysis method, principal component analysis (PCA) can only produce linear subspace feature extractors , which are unsuitable for highly complex and nonlinear data distributions. In contrast, as a nonlinear extension of PCA, kernel principal component analysis (KPCA)  can capture the higher-order statistical information contained in data, thus producing nonlinear subspaces for better feature extraction performance. This has propelled the use of KPCA in a wide range of applications such as pattern recognition, statistical analysis, image processing, and so on [3–8]. Basically, KPCA firstly projects all samples from the input space into a kernel space using nonlinear mapping and then extracts the principal components (PCs) in the kernel space. In practice, such nonlinear mapping is performed implicitly via the “kernel trick”, where an appropriately chosen kernel function is used to evaluate the dot products of mapped samples without having to explicitly carry out the mapping. As a result, the extracted kernel principal component (KPC) of the mapped data is nonlinear with respect to the original input space.
Standard KPCA has some drawbacks which limit its practical applications when handling big or online datasets. Firstly, in the training stage, KPCA needs to store and compute the eigenvectors of a kernel matrix, where is the total number of samples. This computation results in a space complexity of and a time complexity of , thus rendering the evaluation of KPCA on large-scale datasets very time-consuming. Secondly, in the testing stage, the resulting kernel principal components have to be defined implicitly by the linear expression of the training data, and thus all the training data must be saved after training. For a massive dataset, this translates into high costs for storage resources and increases the computational burden during the utilization of kernel principal components (KPCs). Furthermore, KPCA is impractical for many real-world applications where online samples are progressively collected since it is used in a batch manner. This implies that each time new data arrive, KPCA has to be conducted from scratch.
To overcome these limitations, many promising methods have been proposed in the past few years. These methods can be grouped into two classes. The first class is the batch-based modeling method, which requires that all training data is available for estimating KPCs. Rosipal and Girolami proposed an EM algorithm for reducing the computational cost of KPCA . However, the convergence behavior of the EM algorithm to KPCA cannot be guaranteed in theory. In , the kernel Hebbian algorithm (KHA) was proposed as an iterative variant of KPCA algorithm. By kernelizing the generalized Hebbian algorithm (GHA), KHA computes KPCA without storing the kernel matrix, such that large-scale datasets with high dimensionality can be processed. Nonetheless, KHA has a scalar gain parameter which is either held constant or decreased according to a predetermined annealing schedule, leading to slow convergence during the training stage. To improve the convergence of KHA, gain adaptation methods were developed by providing a separate gain for each eigenvector estimate . An improved version of KPCA was proposed based on the eigenvalue decomposition of a symmetric matrix , where datasets are divided into multiple subsets, each of which is processed separately. One of the major drawbacks of this approach is that it requires storing the kernel matrix, which means that the space complexity could be extremely large for large-scale dataset. Another variant of conventional KPCA is greedy KPCA [12, 13], which was employed to approximate the KPCs by a prior filtering of the training data. However, prior filtering of the training data could be computationally expensive. Overall, compared with standard KPCA, these batch-based modeling methods can potentially reduce the time or space complexity to some degree. Unfortunately, such methods cannot handle online data.
The second class is incremental methods, which can compute KPCs incrementally to handle online data processing. Chin and Suter proposed an incremental version of KPCA [14, 15], which is called IKPCA-RS for the notational simplicity. In IKPCA-RS, singular value decomposition is used to update an eigenfeature space incrementally for incoming data. However, IKPCA-RS may lead to high time complexity especially when dealing with high-dimensional data. In [16, 17], an incremental KPCA was presented based on the empirical kernel map. It is more efficient in memory requirement than the standard KPCA. However, it is only an approximate method and only suitable for polynomial kernel function. Inspired by the incremental PCA algorithm proposed by Hall et al. , Kimura et al. presented an incremental KPCA algorithm  in which an incremental updating algorithm for eigenaxes is derived based on a set of linearly independent data. Subsequently, some modified versions are proposed by Ozawa and Takeuchi et al. [20, 21]. Furthermore, in order to incrementally deal with data streams which are given in a chunk of multiple samples at one time, other extensions of KPCA were also successively presented [22–24]. Hallgren and Northrop  proposed an incremental KPCA (INKPCA) by applying rank one updates to the eigendecomposition of kernel matrix. However, INKPCA needs to store the whole data when evaluated on a new sample. Notably, incremental methods have the capacity of integrating new data, initially unavailable, in some way that maintains nonincreasing memory. However, to the best of our knowledge, most of these methods operate in the kernel space where all the samples are implicitly represented. This has two key limitations. First, a number of incremental methods may suffer from high computational cost. Second, the others can only capture the approximate KPCs rather than the accurate ones, which may affect the accuracy of its subsequent process.
Before continuing, a note on mathematical notations is given as follows. We use lower case and upper case letters (e.g., ) to denote scalars, lower case letters with the subscript (e.g., ) to denote an element from a matrix or a vector, lower case bold letters (e.g., ) to denote vectors, and upper case bold letters (e.g., ) to denote matrices. We use () to denote the transpose of a vector (matrix) and to denote the L2-norm of a vector. Furthermore, we adopt to denote a set, to denote a matrix with column vectors and to denote a matrix composed of the corresponding element . In this paper, always denotes a column vector and the inner product between and is expressed as . The lower case bold letter denotes a nonlinear mapping. The mapped sample of the input sample is a column vector.
To address these limitations, we propose a two-phase incremental KPCA (TP-IKPCA), where the mapped data is represented in an explicit form and KPCs are updated in an explicit space. The computational cost of the whole process is very low and the accuracy of KPCs can be theoretically guaranteed. An overview of TP-IKPCA is briefly illustrated in Figure 1. In this figure, denotes the sample set in a d-dimensional input space and denotes the total number of available samples. Let denote the nonlinear mapping which maps the sample set into an h-dimensional implicit kernel space, resulting in the mapped sample set . Here, h may be very large or even infinite, depending on the specific mapping. The TP-IKPCA includes two phases. In the first phase, we develop an incremental algorithm to capture standard orthogonal basis of the subspace spanned by and then explicitly obtain the projection vectors of by where denotes the number of a standard orthogonal basis . In the second phase, the existing incremental method of PCA is employed to capture KPCs based on the explicit data in the projection space. In the following sections, we will detail how to incrementally express the implicit mapped data using an explicit form. We will also theoretically verify that performing PCA based on the implicitly mapped samples is equivalent to that of based on the explicit projection vectors .
Here, we should clarify the relationship among some important quantities, including d, h, r (see Figure 1). In the case of KPCA or TP-IKPCA, the sample set in a d-dimensional input space is firstly mapped into an h-dimensional kernel space by a nonlinear mapping and then a linear PCA is performed based on the mapped set . Usually, h may be very large or even infinite, depending on the specific mapping . So, we usually have , which implies that the dimension of each mapped sample is larger than its original dimension . In the case of TP-IKPCA, denote an orthonormal basis of the subspace spanned by , and is the corresponding projection vectors of the mapped set on the basis , which means that is the dimension of the subspace spanned by and equals the dimension of each projected vector . Generally, since the mapped data have strong linear correlation in the h-dimensional kernel space, we have , which means that a few components generally suffice to capture the nonlinear distribution of the data. Furthermore, if the dimension of the subspace spanned by is high, r may be larger than the dimension of the input space. However, if the mapped data have strong linear correlation, which means may be low and the dimension of the input space is very high, we may get the contrast conclusion.
The main contributions of our work are fourfold: Presenting an algorithm to express the mapped data in an explicit form. This will help for better understanding the distribution of the mapped data in the implicit kernel space. Endowing KPCA with the capacity of handling dynamic dataset. Compared to the standard KPCA, the computational complexity of TP-IKPCA is reduced from to and the storage complexity from to , where denotes the number of training samples and is the number of bases of the subspace spanned by nonlinear mapped samples. Usually the assumption that is valid, which makes TP-IKPCA very convenient for processing large-scale datasets . In the testing stage, the feature extraction from one sample is faster than that of the batch KPCA, since TP-IKPCA only needs to calculate the kernel functions between the new sample and selected training samples which forms the orthonormal basis.
The rest of the paper is organized as follows. Section 2 briefly introduces KPCA. In Section 3, we provide a theoretical analysis of the proposed TP-IKPCA method and elucidate the concrete steps for incrementally capturing KPCs based on the projection vectors in an explicit space. The effectiveness of TP-IKPCA is demonstrated in Section 4. Finally, the conclusions of our study are given in Section 5.
2. Kernel Principal Component Analysis (KPCA)
In this section, we briefly outline the standard procedure of KPCA. As mentioned above, in KPCA, the input sample set is mapped into a kernel space by a nonlinear mapping and then a linear PCA is performed based on in the kernel space.
To obtain the eigenvectors in the kernel space, the covariance matrix is defined aswhere . However, the eigendecomposition of is hindered by the fact that the mapping function is implicit. To avoid the explicit calculation in the kernel space, a kernel matrix is defined, whose elements are determined by the virtue of the following kernel trick:where is a kernel function that allows us to compute inner products in the kernel space .
Combining (2) and (3), Schölkopf et al.  derived the equivalent eigenvalue problem as follows:where denotes the column vector such that the orthogonal eigenvector of the covariance matrix satisfies
Considering need to be normalized, need to satisfy
For a test sample , the projection of on the l-th nonlinear principal component can be computed bywhere is the ith element of ; in other words, .
Of note, for KPCA, the kernel matrix needs to be predefined before performing eigendecompositions. Since the size of scales with , a large memory space is required for a massive dataset. Additionally, the eigendecomposition of involves a time complexity of . This can severely handicap the computation of KPCA on large datasets. In online processing applications, the arrival of a new sample requires adding a new row and a new column in , and eigendecomposition has to be constantly reevaluated for an ever-growing kernel matrix to update the kernel subspaces. Hence, the batch KPCA is not convenient for such applications.
3. Explicit Representation of the Mapped Data
At present, there have been many incremental algorithms for PCA [27–34]. However, it is difficult to directly extend them to KPCA because all mapped samples are expressed implicitly in the kernel space. Obviously, once can be expressed using an explicit form, it will be straightforward to extend incremental PCA to KPCA. In fact, the geometrical structure of can be captured by using a standard orthogonal basis of the subspace spanned by all the samples . Hence, we aim to explicitly express using an indirect way. This motivation comes from the following property shown in Theorem 1.
Theorem 1. If linear PCA is performed based on and , respectively, then their covariance matrices have the same nonzero eigenvalues, and those corresponding eigenvectors satisfy the following relationship:where is the l-th eigenvector of the covariance matrix of and is the l-th eigenvector of the covariance matrix of . The proof is given in Appendix A.
Based on Theorem 1, linear PCA based on can be converted into a linear PCA based on . So, if we can write in an explicit form, then it will be easy to further extend KPCA using existing linear incremental algorithms. To incrementally obtain the orthonormal basis and the projection vectors , we firstly introduce two correlative lemmas.
Lemma 2. Let denote a basis of the subspace spanned by , then the orthonormal basis can be determined usingwhere = . is the eigenvector of the kernel matrix = , scilicet = . is the corresponding eigenvalue of . The proof is given in Appendix B.
Based on Lemma 2, for any mapped sample , we can explicitly define its projection vector under the orthonormal basis aswhere . Obviously, using the kernel function can complete the computation of .
However, the orthogonalization process using Lemma 2 is a batch-based method. Subsequently, when samples are added one by one, its computational cost is still very expensive. So, inspired by the Gram-Schmidt orthogonalization process , we designed an online algorithm for incrementally finding the orthonormal basis and the projection vectors.
Lemma 3. Let denote a basis of the subspace spanned by and is the orthonormal basis obtained by (9). Suppose denotes a new sample we have just included into our dataset. We derive the following properties.
(1) If = = , then is the orthonormal basis of the subspace spanned by and the projection vector of can be computed using (10). Here, = and = .
(2) If = , then the orthonormal basis of the subspace spanned by can be obtained bywhere The projection vector of can be computed by
Obviously, based on Lemma 3, it is straightforward to incrementally estimate the projection vector. Notably, the dimensionality of is smaller than that of in the case of . In fact, based on the Gram-Schmidt orthogonalization process, let denote the projection vector of on the orthonormal basis , then . The proof of Lemma 3 is provided in Appendix C.
Algorithm 4. An online algorithm for incrementally finding the orthonormal basis and the projection vectors.
Step 0 (initialization). For the time N=1, we found a sample . We suppose . Let the set , , and .
Step 1. Calculate for a new sample according to where and have the same definition as in Lemma 3.
Step 2. If , then and return to Step 1.
Step 3. If , then and update using (12). Finally, update using (14) and return to Step 1. Obviously, if we map all the samples of into the kernel space and get the dataset , then the mapped samples are linearly independent. Furthermore, we can get an orthonormal basis based on and (see (9) or (11)). In fact, taking into account the actual calculation error, we usually use a very small threshold value to decide performing Step 2 or Step 3 in Algorithm 4. In other words, if , we perform step 2; otherwise, we perform Step 3.
4. Incremental Learning of KPCA
In this section, we will outline the incremental learning method of KPCA based on the incremental version of PCA (IPCA) proposed by Hall et al. . The key difference between our method and Hall et al.’s method is that the dimensionality of the projection vector in our case is not constant; hence we further adapt IPCA to our aim and address this limitation.
4.1. Description of IKPCA
Given a sample set and its corresponding projection vector set (see the Section 3), we assume we have already built a set of eigenvectors and its corresponding matrix with the set as an input. Note that we have where denotes the dimension of . Let denote the corresponding matrix of eigenvalues and the mean vector. Incremental building requires updating these eigenvectors when a new input sample is obtained, which is the projection vector of . Obviously, the dimensionality of may be larger than that of (see (12)). When their dimensionalities are identical, we denote , otherwise, . Firstly, we update the mean:where means adding one zero to the original vector . Then we update the set of eigenvectors by adding a new vector and applying a rotational transformation. In order to do this, we first compute the orthogonal residual vector:where is computed by
Subsequently, we normalize to obtain for and otherwise. The new matrix of eigenvectors is computed bywhere is a rotation matrix with dimension p+1. is the solution of the eigenproblem of the following form [18, 36]:
We compose aswhere and .
Broadly, the procedure of our incremental method is similar to IPCA presented by Hall et al. . Only under the condition , we add one zero (zero vector) to the corresponding variant (matrix).
4.2. Framework of TP-IKPCA
Based on the analysis in Section 3 and in Section 4.1, we present the flowchart of TP-IKPCA in Figure 2. Here, denotes the input sample set and represents its corresponding mapped set by an implicit nonlinear mapping . Firstly, an orthonormal basis of the subspace spanned by (see (9) or (11)) and the corresponding projection vectors set (see (10)) are obtained. Subsequently, a set of eigenvectors of are built. Then, for a new coming sample , its mapped sample is used to update the orthonormal basis and its projection vector is computed using Algorithm 4. Finally, based on the IKPCA described in Section 4.1, the eigenvector set is updated using . It can be seen from Figure 2 that TP-IKPCA includes two main steps. The first step is based on incremental learning algorithm that represents the mapped data using an explicit form (see Algorithm 4). The second step incrementally computes the principal components of using IKPCA (see Section 4.1).
The dimensions ,, and of the input, kernel, and projection spaces, respectively, generally satisfy the following inequalities: and (see Section 1). Here, we need to focus on the meaning of . In this paper, p denotes the number of the eigenvectors derived from the covariance matrix of the projection vector set . In other words, p is the number of the principal components of . We note that the number of the principal components, that is p, should be also smaller than the dimension of the projection space. In summary, we have .
4.3. Complexity Analysis of TP-IKPCA
Suppose that given the current sample set , the dimension of the subspace spanned by is . When a new sample arrives, TP-IKPCA first conducts Algorithm 4 where the most time-consuming step is to compute appearing in (11)-(14). The time complexity for this step is since is an orthogonal transformation matrix and is an N-dimensional vector. Then, TP-IKPCA incrementally computes the principal components using IKPCA (see Section 4.1). In this step, the computation of the eigendecomposition of matrix in (20) is most time-consuming. We can define the upper bound for its time complexity as since we have . Considering the above two steps, the overall time complexity of TP-IKPCA in worst case can be estimated as . In fact, is usually tenable [26, 37], which makes TP-IKPCA convenient for processing large-scale or online datasets. Meanwhile, TP-IKPCA requires storing an orthogonal transformation matrix (see (9)), which only involves a space complexity of . Especially in the testing stage, obtaining the principal component of a new sample only needs to calculate the kernel functions between the new sample and the old training samples that compose the orthonormal basis (see (21)), which results in improving the computing speed.
We evaluated and compared the performance of TP-IKPCA on synthetic and real datasets with several typical KPCA-based approaches in terms of accuracy and time complexity. The comparison methods include conventional batch mode KPCA, incremental KPCA [14, 15] with reduced-set (IKPCA-RS), and the recently developed incremental KPCA  based on rank one updates to the eigendecomposition of kernel matrix (INKPCA). The time complexity can be captured by two aspects: time required for learning the training data; r, i.e., the number of orthonormal basis elements of the subspace spanned by the mapping training data. Usually, a smaller indicates a reduced time complexity of TP-IKPCA (see Section 4.3). At the same time, we also used two different measures to evaluate the accuracy of TP-IKPCA. The first one is the correlation coefficient between two corresponding principal components (PCs) of TP-IKPCA and KPCA. Since KPCA is performed using all training data in a batch learning model, the PCs of KPCA are the target that TP-IKPCA needs to capture. Ideally, the PCs of TP-IKPCA should be identical with those of KPCA. Therefore, the correlation coefficient between two corresponding PCs is evaluated to show how accurate is TP-IKPCA in comparison to batch KPCA. Specifically, let denote the l-th PC of KPCA after learning all of the samples and is the l-th PC of TP-IKPCA after learning samples, we define the correlation coefficient (corr) by The specific computation in (22) can be deduced from (5) and (8). The second measure is to compare the effectiveness of TP-IKPCA and KPCA. Here, for two-dimensional synthesized datasets, we adopt the contour lines of PCs as an evaluation measure. For real datasets, we adopt the practical denoising effect.
5.1. Synthesized Data
In this experiment, we use two-dimensional nonlinear synthetized data to evaluate the accuracy and memory space efficiency of KPCA, IKPCA-RS, INKPCA, and our proposed TP-IKPCA. The data is generated by:where denotes the standard Gaussian distribution and is the uniform distribution in . In this experiment, the kernel function is the polynomial kernel form, namely, . Here, the parameter m=2.
The contour lines of the first three PCs obtained by each method for N=500 training samples are illustrated in Figure 3 (Top to bottom: KPCA, TP-IKPCA, IKPCA-RS, and INKPCA), where red dots represent samples and green lines represent the contour lines of the corresponding PCs. denotes the eigenvalue of the corresponding PC. Figure 3 shows no visually discernible differences between our method and the 3 comparison methods. In other words, their contour lines are very similar. Furthermore, the differences in eigenvalues are also very small. Therefore, it can be concluded that all of the results derived from the three incremental algorithms: IKPCA-RS, INKPCA, and our proposed TP-IKPCA, follow the ground truth results closely.
Figure 4 shows the evolution curves of the correlation coefficients between the first three PCs obtained by TP-IKPCA, IKPCA-RS, INKPCA, and their corresponding PCs obtained by KPCA when increasing the number of training samples. Figure 4 shows for different incremental algorithms that the resulting correlation coefficient gradually converges to 1 as the number of training samples increases. It indicates that the PCs obtained by TP-IKPCA, IKPCA-RS, and INKPCA gradually approximate those obtained by batch mode KPCA with high accuracy.
Table 1 shows the average training time for learning from 500 training samples and testing time for extracting features from 100 testing samples (in seconds). We repeated this procedure 20 times and reported the averaged training and testing times as well as the corresponding standard deviations.
From Table 1, we can clearly see that when using the synthesized data, all of the three incremental KPCA algorithms have faster training speed than KPCA due to low-dimensional features as well as simple distribution of this data. Among these incremental variants, our proposed TP-IKPCA leads to fastest training speed. From the perspective of testing speed, our method can perform prediction in the least time. This can be explained by the fact that the learning time of TP-IKPCA depends on the size of orthonormal basis r (=9) while KPCA and INKPCA both depend on the total number of training samples (=500 in our case). IKPCA-RS also leads to fast testing speed since it uses a reduced set. However, it performs slower than our method in training speed since seeking a set of approximate preimages when new samples arrive using certain optimization techniques or fixed-point iteration is time-consuming.
In what follows, we design two experiments to investigate the behavior of our algorithm when the number of training samples increases. Firstly, in Figure 5, we plot the variation curve of the number of basis, i.e., r, against the number of training samples (N). From this result, we notice that in the beginning, r gradually increases with . However, after , r stops increasing. This shows that the mapped data have strong linear correlation in the kernel space. Hence, although the number of training samples continues to increase, the number of basis remains stable. More importantly, the computational complexity of TP-IKPCA becomes a constant when , which is a significant improvement over standard KPCA (in the order of ) —particularly when the number of training samples becomes very large.
Then, we compute the acceleration ratio which represents the ratio of the time consumed by KPCA to extract features from 100 test samples to the time consumed by TP-IKPCA. The resulting variation of acceleration ratio for testing speed with respect to the number of training samples is shown in Figure 6. Obviously, a larger ratio indicates a faster test speed of TP-IKPCA compared with KPCA. Figure 6 also shows that the larger the number of training samples, the larger the ratio, implying that TP-IKPCA can significantly improve test speed compared with KPCA.
5.2. MNIST Data
In this section, we consider an image processing application where we process the MNIST database of handwritten digits (http://yann.lecun.com/exdb/mnist/). The database consists of handwritten digits from 0 to 9. Each digit set includes training set and testing set. Each digit is represented by a 28-by-28 image. In order to evaluate the performance of different approaches, we carried out image denoising experiment to the even number. Firstly, for each digit, we randomly select 500 training samples and 100 testing samples and then add the Gaussian noise and the salt-and-pepper noise, respectively, to the testing images. The mean and variance for the Gaussian noise are 0 and 0.2, respectively, while the level of the salt-and-pepper noise is 0.3. Next, we estimate the first sixteen principal components using KPCA, IKPCA-RS, INKPCA, and our proposed TP-IKPCA, respectively. Finally, we perform denoising experiments on all corrupted testing samples and reconstruct them using the reconstructive scheme presented in . In the experiment, we use the Gaussian function as the kernel function in each method and set the bandwidth .
Figure 7 shows some restoration results from the corrupted images by conducting KPCA, TP-IKPCA, IKPCA-RS and INKPCA, respectively. It can be seen from these figures that all of these methods can eliminate noise and reconstruct the images well. Therefore, we can draw the conclusion that different incremental KPCA algorithms can closely approximate batch mode KPCA in reconstruction performance with high accuracy.
Figure 8 shows the evolution curves of the correlation coefficients between the first three PCs by TP-IKPCA, IKPCA-RS, INKPCA, and their corresponding ones by KPCA when the number of training samples increases, where the horizontal axis denotes the number of training samples and the vertical axis is the correlation coefficient computed by (22). Figure 8 shows that all incremental KPCA algorithms lead to good approximation accuracies for the first three PCs since the correlation coefficients gradually converge to 1 as the number of training samples increases. The results indicate that incremental KPCA algorithms can gradually capture PCs in real high-dimensional datasets with good approximation accuracy.
We also display in Table 2 the average training and testing times (in seconds) for training 500 samples and extracting features from 100 testing samples, respectively. We repeated this process 20 times and reported the average values and the corresponding standard deviations.
Firstly, we analyze the training results shown in Table 2. We can derive the following observations: IKPCA-RS consumes much more time in training than other approaches, including batch model KPCA. These results differ from those obtained when using the synthesized data where IKPCA-RS ran faster than batch mode KPCA. Through inspecting the experiments, we found that IKPCA-RS iterates reduced-set expansions many times when computing the preimages for compression. As a result, when handling data with a large number of features (=784 in our case), the calculation of preimages increases the computational load. This observation is in line with the conclusion drawn in . INKPCA needs to apply rank one update to iteratively calculate eigenvalues and eigenvectors associated with kernel matrix when new data arrive. Although INKPCA consumes more time than batch mode KPCA, it does not need to store the whole kernel matrix in memory and has the advantage of better handling massive data where KPCA rapidly becomes infeasible. As for our proposed TP-IKPCA, it runs much faster than other incremental algorithms, including IKPCA-RS and INKPCA, excluding a few cases where our algorithm may be slower than batch mode KPCA. This can be explained by the computational complexity of TP-IKPCA of the order of . In this experiment, the linear correlation between the mapped digital images is very weak, which results in a very large and even approximates the number of training samples.
From the testing results shown in Table 2, we can conclude the following: The testing speeds of KPCA and INKPCA are similar since INKPCA cannot select a few yet important samples from the whole data but still makes use of all available samples when calculating the projections of new data. Therefore, their speed is proportional to the total number of the training samples. Regardless of specific digit, the testing speed of IKPCA-RS and TP-IKPCA are much faster than that of KPCA and INKPCA since both methods are able to reduce the number of samples used for kernel evaluation although they adopt different strategies. The testing time of TP-IKPCA is proportional to the size of basis . In this experiment, the size of basis is smaller than the number of training samples, thus leading to an improvement in the test speed compared with KPCA. Considering the training time, our proposed TP-IKPCA is obviously preferred over IKPCA-RS.
In what follows, we gradually increase the number of training samples and summarize the training and testing time required by TP-IKPCA and standard KPCA. We find from extensive experiments that the computational superiority of TP-IKPCA over KPCA increases with the number of training samples. Taking the experiments on digit “0” as an example, we repeated this evaluation 20 times and recorded the resulting training and testing time (in seconds) required by TP-IKPCA and KPCA under different training sample size . Finally, the averaged time and the standard deviation are shown in Table 3 where the training ratio in Table 3 denotes the ratio of training time of KPCA to that of TP-IKPCA, given a total number of samples. In a similar way, the testing ratio represents the ratio of testing time of KPCA to that of TP-IKPCA when extracting features from 100 test samples.
Based on Table 3, we can make the following observations: As the number of training samples continues to increase, the number of basis tends to increase as well. However, the increasing speed of gradually decreases, which indicates that the larger the size of the training set, the stronger the correlation between the samples. With the increasing of the training set size N, the training time of both KPCA and TP-IKPCA also increases gradually. However, the increasing speed of TP-IKPCA is much slower than that of KPCA. The reason lies in that the time complexity of TP-IKPCA has a close relationship with the number of basis while that of KPCA depends on the total number of training samples . As N increases, the increasing speed of progressively decreases since most of the correlation structure among data has been revealed. We also derive a similar conclusion from the ratio’s evolution in the training stage with respect to the changes of the number of training samples, where the ratio gradually increases with in the training stage. As increases, the testing time of KPCA and TP-IKPCA both increase gradually. However, the increasing speed of TP-IKPCA is much slower than that of KPCA, which is also reflected by the ratio’s evolution in the testing stage. The reason is that the test speed of TP-IKPCA is closely related to the number of basis and the increasing speed of gradually becomes slower with the increasing of . Based on the above analysis, we conclude that TP-IKPCA does significantly improve the computational complexity of KPCA. Moreover, TP-IKPCA can deal with dynamic dataset due to its “incremental” nature.
In this paper, we proposed a novel incremental feature extraction method termed as TP-IKPCA which endowed KPCA with the capability of handling dynamic or large-scale datasets. The proposed TP-IKPCA differs from the existing incremental approaches in providing an explicit form of the mapped data and the updating process of KPCs is also performed in an explicit space. Specifically, TP-IKPCA is implemented in two phases. First, an incremental algorithm is given to explicitly project the mapped samples in the kernel space. Second, we employed the existing incremental method of PCA to capture KPCs based on the explicit data in the projection space. The computational complexity of TP-IKPCA has a close relationship with the size of basis of the subspace spanned by the mapped training samples. Usually, r is much smaller than the number of training samples N, and thus TP-IKPCA can greatly improve the computational complexity of KPCA. In the case of large-scale or online dataset, the computational superiority of our approach is remarkable. Experimental results on synthetic and real datasets demonstrate that TP-IKPCA can significantly improve the time complexity of KPCA while preserving a high accuracy as standard KPCA. In comparison with two incremental KPCA algorithms, TP-IKPCA also illustrates superiority in terms of training and testing speed.
TP-IKPCA can be utilized in any application where KPCA needs to be conducted, especially when training data is of large scale, or can only be collected one by one, where the conventional batch-based KPCA cannot be applied. The idea of this study can be extended to other kernel-based methods, such as Kernel Fisher discriminant analysis (KFDA), Kernel independent component analysis (KICA), and so on.
A. The Proof of Theorem 1
Let denote an orthonormal basis of the subspace spanned by . Equation (1) can be written aswhere is the projection matrix, denotes the orthonormal basis matrix, and is the mapped sample matrix. The covariance matrix of (see (2)) can be expressed using the following formula:where is a matrix and its element can be written as
Let , then is the covariance matrix of . We have
For the relationship of the eigenvalues and the corresponding eigenvector between and , we give a derivation as follows.
Lemma A.1. Let be the l-th nonzero eigenvalue of and be the eigenvector, then is the eigenvector of and its eigenvalue is . Scilicet, .
Proof. Based on the above definitions, we have and. On the other hand, is established. So, , thus, . Hence, , Scilicet, .
Lemma A.2. Let be the l-th nonzero eigenvalue of and be the corresponding eigenvector, then is the eigenvector of and its eigenvalue is . Scilicet, .
Proof. Based on Lemmas A.1 and A.2, we know the nonzero eigenvalue of is the same with that of and the corresponding eigenvector satisfies . Hence, the eigendecomposition of can be converted into the corresponding process of . And because , so (8) still holds after the eigenvector is unitized. So, Theorem 1 is proven.
B. The Proof of Lemma 2
Let and be the corresponding matrix which are, respectively, from the eigenvector and the eigenvalue of the kernel matrix . We have . Let and ; we have . Thus = = . So, Lemma 2 is proven.
C. The Proof of Lemma 3
If is the orthonormal basis of the subspace spanned by , then . Based (9), we have
So, conclusion (1) in Lemma 3 is proven. Subsequently, if , this means cannot be linearly expressed by . Based the Gram-Schmidt orthogonalization process, can be determined using the following formula:
In our manuscript, we used two datasets to support the findings of our study. One dataset is the synthetic toy data, which can be generated by the following way: = ( 2 2 + 1) +0.2, (0,1), where denotes the standard Gaussian distribution and is the uniform distribution in . The second data set is the MNIST database of handwritten digits, which are available on this site: http://yann.lecun.com/exdb/mnist.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This work was supported in part by National Natural Science Foundation of China (Grants nos. 61773244, 61373079, and 61572344), National Institutes of Health in USA (AG041721, MH107815, EB006733, EB008374, and EB009634), and Provincial Natural Science Foundation of Shanxi in China (2018JM4018).
F. Vojtěch and H. Václav, “Greedy algorithm for a training set reduction in the kernel methods,” in Proceedings of the 10th International Conference on Computer Analysis of Image and Patterns, vol. 2756 of Lecture Notes in Comput. Sci., pp. 426–433, Springer, Groningen, Netherlands, August 2003.View at: Publisher Site | Google Scholar | MathSciNet
F. Vojtech, Optimization Algorithms for Kernel Methods [Ph.D. Dissertation], Center for Machine Perception, Czech Technical University, Prague, Czech Republic, 2005.
B.-J. Kim, “Active visual learning and recognition using incremental kernel PCA,” in Proceedings of the 18th Australian Joint Conference on Advances in Artificial Intelligence AI’05, vol. 3809 of Lecture Notes in Comput. Sci., pp. 585–592, Springer, 2005.View at: Publisher Site | Google Scholar | MathSciNet
S. Kimura, S. Ozawa, and S. Abe, “Incremental Kernel PCA for online learning of feature space,” in Proceedings of the 2005 International Conference on Computational Intelligence for Modelling, Control and Automation, vol. 1, pp. 595–600, Vienna, Austria, November 2005.View at: Google Scholar
O. Seiichi, Y. Takeuchi, and A. Shigeo, “A fast incremental kernel principal component analysis for online feature extraction,” in Proceedings of the Pacific Rim International Conference on Trends in Artificial Intelligence, vol. 6230 of Lecture Notes in Computer Science, pp. 487–497, Springer, 2010.View at: Publisher Site | Google Scholar
T. Takaomi and O. Seiichi, “A fast incremental kernel principal component analysis for learning stream of data chunks,” in Proceedings of the 2011 International Joint Conference on Neural Networks (IJCNN 2011 - San Jose), pp. 2881–2888, San Jose, CA, USA, July 2011.View at: Publisher Site | Google Scholar
G. Baudat and F. Anouar, “Kernel-based methods and function approximation,” in Proceedings of the International Joint Conference on Neural Networks (IJCNN'01), pp. 1244–1249, Washington, DC, USA, July 2001.View at: Google Scholar
O. Seiichi, P. Shaoning, and K. Nikola, “A modified incremental principal component analysis for on-line learning of feature space and classifier,” in Proceedings of the 8th Pacific Rim International Conference on Artificial Intelligence, pp. 231–240, Auckland, New Zealand, 2004.View at: Google Scholar
R. A. Horn and C. R. Johnson, Matrix Analysis, Cambridge University Press, New York, NY, USA, 2013.View at: MathSciNet
S. B. Mike, B. Scholkopf, and A. J. Smola, “Kernel PCA and denoising in feature space,” in Advances in Neural Information Processing System, pp. 524–536, MTI press, Cambridge, UK, 1999.View at: Google Scholar