Abstract

We introduce multiscale wavelet kernels to kernel principal component analysis (KPCA) to narrow down the search of parameters required in the calculation of a kernel matrix. This new methodology incorporates multiscale methods into KPCA for transforming multiscale data. In order to illustrate application of our proposed method and to investigate the robustness of the wavelet kernel in KPCA under different levels of the signal to noise ratio and different types of wavelet kernel, we study a set of two-class clustered simulation data. We show that WKPCA is an effective feature extraction method for transforming a variety of multidimensional clustered data into data with a higher level of linearity among the data attributes. That brings an improvement in the accuracy of simple linear classifiers. Based on the analysis of the simulation data sets, we observe that multiscale translation invariant wavelet kernels for KPCA has an enhanced performance in feature extraction. The application of the proposed method to real data is also addressed.

1. Introduction

The majority of the techniques developed in the field of computational mathematics and statistics for modeling multivariate data have focused on detecting or explaining linear relationships among the variables, such as, in principal component analysis (PCA) [1]. However, in real-world applications the property of linearity is a rather special case and most of the captured behaviors of data are nonlinear. In data classification, a possible way to handle nonlinearly separable problems is to use a non-linear classifier [2, 3]. In this approach a classifier constructs an underlying objective function using some selected components of the original input data. An alternative approach presented in this paper is to map the data from the original input space into a feature space through kernel-based methods [4, 5].

PCA is often used for feature extraction in high dimensional data classification problems. The objective for PCA is to map the data attributes into a new feature space that contains better, that is, more linearly separable, features than those in the original input space. As the standard PCA is linear in nature, the projections in the principal component space do not always yield meaningful results for classification purposes. For solving this problem, various kernel-based methods have been applied successfully in machine learning and data analysis (e.g., [610]). The introduction of the kernel allows working implicitly in some extended feature space, while doing all computations in the original input space.

Recently, wavelet kernels have been successfully used in support vector machines (SVM) learning to classify data because of their high flexibility [9, 11]. The Gaussian wavelet kernel, one of the most common kernels used in practice, has been used as either a dot-product kernel or a translation invariant kernel. Besides them, many other possible wavelet kernels are commonly used, including the cubic B-spline wavelet kernel, Mexican hat wavelet kernel, or Morlet wavelet kernel. Although kernel-based classification methods enable capturing of nonlinearity of the data attributes in the feature space, they are usually sensitive to the choices of parameters of a given kernel [6]. Similarly, in kernel PCA (KPCA) [12, 13], optimization of kernel parameters is difficult. The search of hyperparameters via cross-validation methods could be computationally expensive because of many possible choices of parameter values [2]. This calls for the construction of a new type of kernel that performs well as the feature extraction method in KPCA.

Much current research has been focused on the development of multiscale kernel methods, for example, [1418]. These methods have been used in non-linear classification and regression problems. For instance, [19] proposed a multiscale kernel method in SVM to improve the Gaussian radial basis function (RBF) by combining several terms of the RBF kernel at different scales. In [19], evolutionary strategies are used for searching the appropriate values of the kernel parameters, but they are very time consuming. In [20], a multiscale kernel method used in SVM improved classification accuracy over the traditional single-scale SVM. Also multiscale kernel methods have been proposed to support vector regression, for example, [2124].

Our work is different from those discussed above as we focus on the construction of multiscale wavelet kernels for KPCA in data classification. We propose to use the multiscale kernels in the process of feature extraction rather than in the classification step. This innovation aims at extracting a set of better linear separable features so that a simple classifier can be applied for classification. Our method incorporates multiscale methods into KPCA, making wavelet kernel PCA (WKPCA) performs well in extracting data features. We do not search for optimal values of the kernel parameters of a given kernel that are often obtained by cross-validation methods. Instead, we focus on constructing multiscale wavelet kernels that are parameter free. We aim to investigate these kernels and to see how each of these kernels performs in multiscale data classification.

This paper is organized as follows. In Section 2 we provide a brief description of KPCA as a feature extraction method. Sections 3 and 4 discuss the methods of constructing multiscale wavelet kernels. In Section 5 we discuss the computational aspects of multiscale wavelet kernels. In Sections 6 and 7 we discuss the results of the simulation experiments and an application to real data. In Section 8 we report our conclusions and outline future work.

2. Kernel PCA

The KPCA aims, for a given data set {𝐱1,,𝐱𝑛𝐱𝑗𝑅𝑑forall𝑗}, to capture the nonlinear relationships among the data by mapping the original observations 𝐱1,,𝐱𝑛𝑅𝑑 into a feature space that is spanned by column vectors Φ(𝐱1),,Φ(𝐱𝑛), where the function Φ() maps 𝐱𝑖 into the feature space, for each 𝑖=1,,𝑛 [2, 3, 13]. The map Φ() is usually determined by the Gaussian function, or by a polynomial function, or by a reproducing kernel in Hilbert space. Here, we focus on the wavelet kernels that result in positive semidefinite kernel matrices. Assuming that the data Φ(𝐱1),,Φ(𝐱𝑛), in the feature space, are centered (this assumption will be relaxed later), and viewing Φ(𝐱1),,Φ(𝐱𝑛) as independent random vectors, for 𝐱𝑖𝑅𝑑, the sample covariance matrix of these random vectors can be written as (see [2]) follows: 𝐶=1𝑛𝑛𝑗=1Φ𝐱𝑗Φ𝐱𝑗.(1) The aim of PCA applied to the covariance matrix 𝐶 is to find the eigenvalues 𝜆 and eigenvectors 𝐕 of 𝐶. The aim of the eigenvalue analysis is to choose eigenvectors 𝐕 to be spanned by Φ(𝐱1),,Φ(𝐱𝑛), so that, the calculation of eigenvalues and eigenvectors can be done through the so-called kernel matrix 𝐊, which is defined by: 𝑘𝐱𝑖,𝐱𝑗=Φ𝐱𝑖Φ𝐱𝑗,1𝑖,𝑗𝑛.(2) The objective of the principal component extraction is to project the transformed observation Φ(𝐱) into the linear space spanned by the normalized eigenvectors ̃𝐜𝐥, for 𝑙=1,,𝑛. As we focus only on Mercer kernels, 𝐶 is a positive semidefinite matrix and all its eigenvalues are positive. Thus, the coefficients of the projected vector Φ(𝐱) are given by Φ(𝐱)̃𝐜𝑙=𝑛𝑖=1̃𝑐𝑙𝑖𝑘𝐱𝑖,𝐱,(3) where ̃𝐜𝑙=(̃𝑐𝑙1,,̃𝑐𝑙𝑛) and 𝑙=1,,𝑛.

In the above derivation we assumed that Φ(𝐱1),,Φ(𝐱𝑛) are centered. In practice, one needs to relax this condition. Therefore, instead of using the kernel matrix 𝐊, one should work with the centered version of 𝐊, which is given by the following expression: 𝐊=𝐊𝟏𝑛𝐊𝐊𝟏𝑛+𝟏𝑛𝐊𝟏𝑛,(4) where 𝟏𝑛 is a matrix such that (𝟏𝑛)𝑖𝑗=1/𝑛, for 1𝑖,𝑗𝑛. The details of the derivation of 𝐊 can be found in [2].

3. Multiscale Dot-Product Wavelet Kernel Construction

In this section a method of constructing a dot-product type wavelet kernel using multiple mapping functions is proposed. For a single-scale kernel (i.e., only one translated factor), the performance of KPCA in data classification may be affected by both a choice of a kernel and a choice of values of parameters of a kernel. The practical solution is first to investigate what choices of kernel are appropriate for the data, then to search for suitable kernel parameters of the given kernel based on the selected kernel. When data appears to be multiscale, for example, and exhibits nonstationarity in mean or in data variance, then the use of single-scale KPCA may not be a good choice as the feature extraction method in data classification due to the complex structure of the data.

The construction of kernels based on multiple mapping functions provides a framework for extending a single-scale kernel to a multiscale kernel in KPCA. Let 𝜙𝑖𝐱𝑅𝑑𝜙𝑖(𝐱)𝑖,𝑖{1,2,,𝑔} be a nonlinear map and 𝑖 be the respective feature space, where 𝐱 is a column vector and 𝑔 is the total number of mapping functions. From 𝜙𝑖, we construct another mapping function 𝜙𝑖𝐱𝑅𝑑 defined as 𝜙𝑖(𝐱)=𝟎,,𝟎1,,𝑖1,𝜙𝑖(𝐱),𝟎,,𝟎𝑖+1,,𝑔,(5) where is a Hilbert feature space being the direct sum of 𝑖 and 𝜙𝑖(𝐱) is a column vector with 𝑑𝑔 entries for a given 𝐱. Define a new map Φ based on 𝜙𝑖(𝐱) as Φ(𝐱)=(𝜙1(𝐱),,𝜙𝑔(𝐱)). In this case, Φ maps 𝐱 into a 𝑑𝑔×𝑔 2-D feature space. Using the map Φ as a feature map in KPCA, the original data set {𝐱𝟏,,𝐱𝐧}𝑅𝑑 is mapped into Φ=(Φ(𝐱1),,Φ(𝐱𝑛)). As a result, Φ has 𝑛𝑔 columns, a number that is usually very large. The high dimension of the feature map causes an intensive computation problem in KPCA. One of the solutions to this problem is to reduce the dimension of Φ. Instead of arranging Φ(𝐱) in a matrix, we arrange them into a vector which replaces Φ(𝐱) by Φ(𝐱)=𝑔𝑖=1𝛼𝑖𝜙𝑖(𝐱), where 𝛼𝑖 is a weight coefficient applied to the map 𝜙𝑖(𝐱) and 𝛼𝑖 is a positive real value. For simplicity, 𝛼𝑖 can be chosen as 1/𝑔. Without loss of generality, we assume that Φ(𝐱) have zero means. For 𝐱,𝐲𝑅𝑑, using Φ(𝐱) as the feature map, the kernel function in KPCA becomes 𝑘(𝐱,𝐲)=𝑔𝑖=1𝛼𝑗𝜙𝑖(𝐱)𝑔𝑖=1𝛼𝑖𝜙𝑖(𝐲)=𝑔𝑖=1𝛼𝑖𝜙𝑖(𝐱)𝜙𝑖(𝐲)(6) due to the fact that 𝜙𝑖(𝐱)𝜙𝑗(𝐲)𝑇=0 for 𝑖𝑗. If we denote 𝑘𝑖(𝐱,𝐲)=𝜙𝑖(𝐱)𝜙𝑖(𝐲)𝑇, then 𝑘(𝐱,𝐲)=𝑔𝑖=1𝛼𝑖𝑘𝑖(𝐱,𝐲). Therefore, a single-scale kernel is just a special case of a kernel with multiple mapping functions that takes 𝑔=1 and 𝛼𝑖=1.

Using a mother wavelet function 𝜓𝑗𝑘() with dilation factor 𝑎𝑗 and translation factor 𝑏𝑘, for 𝑗=0,𝐽1 and 0𝑘𝑁, as a set of basis functions of the mapping function 𝜙𝑖(𝐱), and taking 𝛼𝑖=1/𝑎𝑗, the kernel function in (6) can be rewritten as 𝑘(𝐱,𝐲)=𝑑𝑖=1𝐽1𝑗=0𝑁𝑘=01𝑎𝑗𝜓𝑥𝑖𝑏𝑘𝑎𝑗𝜓𝑦𝑖𝑏𝑘𝑎𝑗.(7) We call the kernel function in (7) the multiscale dot-product wavelet kernel (MDWK). The MDWK is a special case of the dot-product wavelet kernel when the dilated translated versions of a mother wavelet function are chosen as multiple mapping functions. In kernel-based methods, it is required that the constructed kernel must be a Mercer kernel, that is, it must have a positive semidefinite kernel matrix [3].

Theorem 1. Let 𝜓(𝑥) be a mother wavelet function, let 𝑎𝑗,𝑏𝑘𝑅+ denote the dilation and translation factors, respectively, then for any 𝐱,𝐲𝑅𝑑 and a finest resolution level 𝐽, the dot-product wavelet kernel function 𝑘(𝐱,𝐲)=𝑑𝑖=1𝐽1𝑗=0𝑁𝑘=01𝑎𝑗𝜓𝑥𝑖𝑏𝑘𝑎𝑗𝜓𝑦𝑖𝑏𝑘𝑎𝑗,(8) is a Mercer kernel defined on 𝑅𝑑×𝑅𝑑.

The proof of this theorem is provided in the Appendix.

As a special case, we obtain the single-scale dot-product wavelet kernel (SDWK) 𝑘(𝐱,𝐲)=𝑑𝑖=1𝜓𝑥𝑖𝑏𝑎𝜓𝑦𝑖𝑏𝑎,(9) where 𝑎𝑅+ and 𝑏𝑅.

4. Multiscale Translation Invariant Wavelet Kernel Construction

Another type of the single-scale kernel is a distance function called translation invariant (TI) kernel [11]. The TI kernel is defined as 𝑘(𝐱,𝐲)=Φ(𝐱𝐲), where 𝐱,𝐲𝑅𝑑. However, for a TI kernel to be used as a kernel in KPCA, again, one has to show that the kernel matrix constructed from the TI kernel is positive semidefinite. To show this, we notice that 𝑘𝑗(𝐱,𝐲)=𝑑𝑖=1𝜓𝑗((𝑥𝑖𝑦𝑖)/𝑎), where 𝑎 is a single-scale parameter. A kernel defined as 𝑘(𝐱,𝐲)=1𝑔𝑔𝑗=1𝑘𝑗(𝐱,𝐲)=1𝑔𝑔𝑗=1𝑑𝑖=1𝜓𝑗𝑥𝑖𝑦𝑖𝑎(10) is also a Mercer kernel if 𝑘𝑗(𝐱,𝐲) are Mercer kernels, for all 𝑗=1,,𝑔. In order for a multiscale wavelet kernel to be a Mercer kernel, the single-scale kernel based on a given mother wavelet function must be a Mercer kernel. A family of TI wavelet Mercer kernels often used in machine learning is Gaussian wavelet kernels described as follows. Let 𝜓(𝑥)=(1)𝑝𝐶2𝑝(𝑥)exp(𝑥2/2) be a Gaussian mother wavelet function, where 𝐶2𝑝(𝑥)exp(𝑥2/2) is the 2𝑝 th step's differential coefficient of the Gaussian function, then the TI Mercer kernel using this Gaussian mother wavelet function is 𝑘(𝐱,𝐲)=𝑑𝑖=1(1)𝑝𝐶2𝑝𝑥𝑖𝑦𝑖𝑎exp𝑥𝑖𝑦𝑖22𝑎2.(11) Different values of 𝑝 give different Gaussian mother wavelet functions. In particular, when 𝑃=0,𝐶2𝑝(𝑥)=1, then this Gaussian wavelet function is a Gaussian function, and when 𝑝=1,𝐶2𝑝(𝑥)=𝑥21, then this Gaussian wavelet function is the so-called Mexican hat mother wavelet [25].

Morlet mother wavelet has been recently used in signal classification and compression [26]. We present a TI wavelet Mercer kernel based on the Morlet mother wavelet function because this mother wavelet as kernel has not been used in either support vector regression or SVM. The proof that this kernel is a Mercer kernel and the investigation of the performance of this kernel in KPCA are needed for using this type of wavelet kernel.

Theorem 2. Morlet mother wavelet function is 𝜓(𝑥)=cos(5𝑥)exp(𝑥2/2). The Mercer kernel using this Morlet mother wavelet function is 𝑘(𝐱,𝐲)=𝑑𝑖=1cos5𝑥𝑖𝑦𝑖𝑎exp𝑥𝑖𝑦𝑖22𝑎2.(12)

The proof of this theorem is provided in the Appendix.

In general, a single-scale kernel, for example, the Gaussian kernel, is a smooth kernel and thus may not be able to capture some local behaviors of data. Wavelet kernels are more flexible than other types of kernels, for example, polynomial kernels or the Gaussian kernel. This is why the mother wavelet functions are adopted as kernels. Moreover, multiscale wavelet kernels combine multiple single-scale wavelet kernels at different scales. They are more flexible than single-scale wavelet kernels because both large and small scales are used in the kernel functions.

5. Computation of Multiscale Wavelet Kernels

In this section, we discuss the computational issue of kernel matrix of multiscale wavelet kernel that needs to be addressed for KPCA. For a given data set, {𝐱1,,𝐱𝑛𝐱𝑗𝑅𝑑forall𝑗}, we first calculate the sample standard deviation of the data with coordinate number 𝑙, denoted by 𝜎𝑙, for 𝑙=1,,𝑑. The data with coordinate number 𝑙 are then divided by 𝜎𝑙 to remove the potential effect of different scales of the observations. Before PCA is applied, a kernel matrix 𝐊 obtained from either the dot-product type of function (7) or the translation invariant type of function is computed. In the computation of the kernel matrix of the MDWK described in (7), the values of 𝑎𝑗 and 𝑏𝑘 and their indexes 𝑗 and 𝑘 are selected in this paper as follows. The values of 𝑎𝑗 are in powers of 2, that is, 𝑎𝑗{1,20.25,,20.25𝑗,,20.25(𝐽1)} for a given level 𝐽, which is 6 in this paper.

For each 𝑎𝑗, the sequence 𝑏𝑘 is selected as 𝑏𝑘=𝑘𝑢0𝑎𝑗, as suggested in [27]. Here, 𝑢0 controls the resolution of 𝑏𝑘 and is set to be 0.5. The range of 𝑘 is the set {0,1,,10} which is determined by the border of the mother wavelet function used in this paper. For the MTIWK, one does not need to specify the values of 𝑏𝑘, and the values of 𝑎𝑗 are chosen to be the same as the ones in the MDWK. The multiscale kernel functions are constructed via a semiparametric method because we do not calibrate the kernel parameters. Instead, we use the dilated and translated versions of a mother wavelet function, with the parameters in powers of 2. In this paper we used the following mother wavelet functions, Gaussian mother wavelet function, Morlet mother wavelet function, and Mexican hat mother wavelet function.

As we said earlier in kernel-based methods, it is important that the constructed kernel matrix is positive semidefinite. The kernel matrix 𝐊 based on the SDWK defined in (9) is always positive semidefinite [3]. The MDWK defined in (7) is also a Mercer kernel as the linear combination of Mercer's kernels is a Mercer's kernel [19]. A single-scale TI kernel is a Mercer kernel if it satisfies the Fourier condition [3], which implies that the kernel matrix is positive semi-definite. In order that a multiscale TI wavelet kernel is a Mercer kernel, the single-scale TI kernel based on a given mother wavelet function must be a Mercer kernel. The Gaussian kernel and Mexican hat kernel are Mercer kernels. Therefore, the multiscale TI Gaussian kernel and the multiscale TI Mexican hat kernel are Mercer kernels.

6. Simulation Experiments

The purpose of simulation experiments is to explore the performance of our proposed method in applications to noisy multiscale data under different levels of the signal to noise ratio.

6.1. Simulation Design
6.1.1. Clustered Data

We consider two-class two-dimensional clustered data, denoted by 𝒟={𝐱𝑖,𝐲𝑖𝑖=1,,𝑛}, where 𝐱𝑖=(𝑥𝑖,1,𝑥𝑖,2) represents the data of Cluster 1, 𝐲𝑖=(𝑦𝑖,1,𝑦𝑖,2) represents the data of Cluster 2, and 𝑛 is the total number of data points of each cluster. The simulation model is given by the following expressions: 𝑥𝑖,1=𝑥0,1+𝜎𝑟𝑥𝑒𝑖,𝑥𝑖,2=𝑥0,2+𝜎𝑟𝑥𝑒𝑖,𝑦𝑖,1=𝑦0,1+𝜎𝑠𝑦𝑒𝑖,𝑦𝑖,2=𝑦0,2+𝜎𝑠𝑦𝑒𝑖,(13) where (𝑥0,1,𝑥0,2) and (𝑦0,1,𝑦0,2) are the coordinates of the centers of Cluster 1 and Cluster 2, respectively; 𝜎𝑟𝑥 and 𝜎𝑠𝑦 are the signal-to-noise ratio of each dimension of Cluster 1 and Cluster 2, respectively. The added underlying noises 𝑒𝑖𝑁(0,1), and are independent and identically distributed for both clusters.

6.2. Data Classification

In this section, we discuss the results on how the WKPCA method performs in data classification. As we aim for linearly separable features, we apply the linear classifier, that is, Fisher linear discriminate (FLD), for our classification problems, to see if linearity of data is improved after feature extraction. The feature extraction methods by PCA, by single-scale WKPCA with respect to different values of kernel parameter 𝑎, and by multiscale WKPCA are considered. The Gaussian function, the Mexican hat mother wavelet function and the Morlet mother wavelet function are used for constructing kernels. Also, the following set of values of the kernel parameter 𝑎, that is, 𝑎{1,20.25,,20.5,20.75,2,21.25}, is selected for the single-scale WKPCA. The multiscale wavelet kernels are constructed using all the values of 𝑎𝑗 belonging to the set {1,20.25,,20.5,20.75,2,21.25} for all multiscale wavelet kernels. In order to evaluate the performance of the feature extraction methods, the average classification accuracy rate of the single-scale WKPCA, which is calculated over all values of the parameter 𝑎 used in the single-scale WKPCA, is compared to both the multiscale WKPCA and conventional PCA.

6.2.1. Homogeneous Clustered Data

The training data and the test data are simulated using the simulation model described in Section 6.1.1. The values of the model parameters for simulating both of the training data sets and both of the test data sets are as follows: 𝑥0,1=0, 𝑥0,2=5, 𝑦0,1=4, 𝑦0,2=0, and 𝑛=100. In the case of 𝜎𝑟𝑥=𝜎𝑠𝑦, the simulated clustered data are homogeneous between the clusters. We consider 25 different values of 𝜎𝑟𝑥. Each pair of 𝜎𝑟𝑥 and 𝜎𝑠𝑦 is denoted by (𝜎𝑟𝑥, 𝜎𝑠𝑦), where for 𝑟=1,2,,25 and 𝑠=1,2,,25, for simulating the training data and the test data. The values of 𝜎𝑟𝑥 and 𝜎𝑠𝑦, are taken as 𝜎1𝑥=𝜎1𝑦=0.1, 𝜎2𝑥=𝜎2𝑦=0.3, 𝜎3𝑥=𝜎3𝑦=0.5,, and 𝜎25𝑥=5, respectively.

Figures 1(a) and 1(c) show the classification accuracy rates for the PCA method and for WKPCA method with different choices of the types of wavelet kernel and with respect the different values of 𝜎𝑥. In Figure 1(a), the feature extraction by the conventional PCA in data classification of the simulated homogeneous clustered data performs similarly as of the feature extraction by WKPCA method. Although 20 extended features are used for classification, feature extraction by WKPCA methods do not improve the classification accuracy rates in homogeneous clustered data classification. This result implies that the KPCA-based feature extraction method does not enhance the accuracy of the data classification when the kernel-based feature extraction method plus a linear classifier method are applied to linear separable data. The PCA and WKPCA perform similarly in the case of using Mexican hat kernel. Figure 1(c) shows that the feature extraction methods by PCA and MTIWK PCA have similar performance in data classification, however the feature extraction method by multiscale dot-product KPCA has worse performance than either the method by PCA or MTIWK PCA. The single-scale wavelet KPCA has the worst performance in data classification. With the increase of data variation, the MTIWK PCA behaves more robustly as a feature extraction method because it has the best performance among the other methods.

6.2.2. Heterogeneous Clustered Data

From the discussion in Section 6.2.1, we notice that WKPCA as a feature extraction method does not outperform the conventional PCA method for homogeneous clustered data. For some wavelet kernels, for example, SDWK or MDWK based on the Morlet mother wavelet function, the WKPCA as the feature extraction method performs worse than the conventional PCA. This is because (1) the data in each coordinate of the clusters appears to be approximately single-scale, thus the conventional PCA as the feature extraction method becomes appropriate for this type of data; (2) the homogeneous clustered data can be treated as linearly separable data with large data variation. Therefore using a nonlinear method of feature extraction does not enable an improvement of the performance of feature extraction. From our experiments, we observe that the performance of the WKPCA with Mexican hat kernel is approximately equal to the PCA.

In order to demonstrate the application of WKPCA as a feature extraction method in the classification of multiscale data, we simulate the training data and the test data using the simulation model described in Section 6.1.1. To simplify the problem, we fix the value of 𝜎𝑠𝑦 to be 5 for 𝑠=1,2,30 and take different values for 𝜎𝑟𝑥. The rest of values of the model parameters remain the same as those of Section 6.2.1, except the values of 𝜎𝑟𝑥, which are taken as 𝜎1𝑥=0.1, 𝜎2𝑥=0.2,,and𝜎30𝑥=3, respectively.

In Figure 1(b), one can see that the feature extraction method by the conventional PCA for heterogeneous clustered simulation data has worst performance, and the feature extraction method by the multiscale WKPCA performs better for the same simulation data. Also, for the data sets with 𝜎𝑥 larger than 1.5, MTIWK in KPCA performs better than MDWK. The average classification accuracy rates (in green) when STI wavelet kernel with 𝑎=1,20.25,20.5,20.75,2,21.25, respectively, is used in WKPCA, are all lower than those when the multiscale wavelet kernel is used in WKPCA. In Figure 1(d), MTIWK in KPCA is more robust as a feature extraction method than MDWK PCA and SDWK PCA. The conventional PCA is even better than WKPCA with Morlet dot-product kernel, that is, MDWK and SDWK.

6.2.3. Performance Evaluation Based on Monte Carlo Simulation

The multiscale WKPCA with TI kernels outperforms the conventional PCA, STIWK PCA, and MDWK PCA. However, the results of classification accuracy rates are based only on one training data set and one test data set for each pair of (𝜎𝑥,𝜎𝑦). In order to further evaluate the performance of WKPCA as the feature extraction method, we use the Monte Carlo simulation method to estimate the average classification accuracy rates and their sample standard deviation using the simulation model presented in Section 6.1.1.

The values of 𝜎𝑥 are taken as 0.1,0.3,, and 2.9, with the other model parameters remaining the same as the ones in Section 6.2.2. For each simulation model setup with a different value of 𝜎𝑥, the average classification accuracy rate and its sample standard deviation are computed for different types of kernel. Choice of feature extraction method is made from the following: PCA, the multiscale WKPCA with either the Gaussian kernel or the Mexican hat kernel, and the single-scale WKPCA with either the Gaussian kernel or the Mexican hat kernel, and with different values of parameter 𝑎 of the kernel (i.e., 𝑎=1,20.25,20.5,20.75,2,21.25). In the case of the multiscale WKPCA, both the dot-product kernel and the TI kernel are considered. Only the TI kernel is investigated for single-scale WKPCA. For each simulation model setup, 𝑚=100 simulations are run, each having a different value of the random seed, to produce 𝑚 training data sets and 𝑚 test data sets.

Besides a choice of the kernel and the determination of kernel parameters, feature dimension is also an important issue. The classification accuracy for a given data set may depend on a choice of feature dimension, requiring an investigation of how classification accuracy is related to the feature dimension. Estimates of the average classification accuracy rate and its sample standard deviation are obtained by applying the Monte Carlo method. The results of the average classification accuracy rates for a different number of retained features are reported in Figures 2(a)2(f). The results of the change in behavior of the sample standard deviation for the average classification accuracy rate are presented in Figures 3(a)3(f). Data classifications using both the FLD classifier and the feature extraction by PCA plus the FLD classifier have the worst performance for the simulated heterogeneous clustered data. Data classification using the multiscale WKPCA as feature extraction method shows the best performance. The feature extraction method using the multiscale WKPCA is less affected by the data variances than the feature extraction methods by PCA and the single-scale WKPCA.

7. Application to Epileptic EEG Signal Classification

In order to demonstrate how the proposed methods perform when applied to real data, we use a set of EEG signals coming from healthy volunteers and from patients during seizure-free intervals. EEG signals are typically multiscale and nonstationary in nature. The database is from the University of Bonn, Germany (http://epileptologie-bonn.de/cms/front_content.php?idcat=193), and has five sets, denoted as sets A, B, C, D, and E. We use only the sets A and E in our illustration. Set A consists of the signals taken from five healthy volunteers who were relaxed and in the awake state with eyes open. Signals in the set E were recorded from within the epileptogenic zone and contains only brain activity measured during seizure intervals. Each data group contains 100 single-channel scalp EEG segments of 23.6 second duration and each sampled at 173.61 Hz.

The problem we consider is the classification of normal signals (i.e., set A) and epileptic signals (i.e., set E). Since we deal with extremely high dimensional data (i.e., 𝑑=4097), in order to make our classification task be computationally efficient, we first extract the signal features by calculating the wavelet approximation coefficients of each signal in data sets A and E. These signals are normalized before applying the wavelet transform using the Symlet 8 wavelet. We use the high-level wavelet decompositions due to the concerns of sparsity and the goal of obtaining high signal discrimination power. The samples of extracted features are shown in Figure 4. Note that the coefficients of wavelet approximation around the two edges do not provide useful information for signal classification as those are affected by the edges of signals when the wavelet transform is applied. Only the coefficients of wavelet approximation within the central portion are considered as they are not affected by signal boundaries and have higher discrimination power compared to those around the edges. For example, at decomposition level 10, we obtained a set of three-dimensional features as the input of kernel PCA. Such low-dimensional feature in wavelet domain may not be sufficient to capture signal time-variability. Therefore, we consider additional two cases, that is, level 9 and level 8 wavelet decompositions. We select 7 and 11 features, which are corresponding to the wavelet approximation coefficients that ranges from the 8 to 14 and from 10 to 20 (within the central portion), respectively, for level 9 and 8 wavelet decompositions. We did not try a smaller level than 8 as it gives a very high dimensional feature set and the selection of features becomes difficult for those small levels.

As the input signals are normalized before the wavelet analysis, we eliminate the differences of the signal energy between groups. The high variability of extracted features reflects a high signal variability in original time domain. As we can see, the extracted features of normal signals are more fluctuated than the epileptic ones. This fact is coincided with the clinical findings about the rhythms of epileptic signals, which are more regularly fluctuated, that is, tends to be more deterministic. For all three cases that we considered, the WKPCA coupled with different kernels is applied to the wavelet approximation coefficients of signals and up to 20 principal components are extracted from WKPCA. The obtained results of classification accuracy, using different types of wavelet kernel and simple classifiers, are reported in Figure 5. As the high level of wavelet decomposition can only capture a very fine version of the signal, a level 10, which only gives the three-dimensional features, is not enough to capture signal time variability among groups. Although signal features are extended in PC space, it is important to retain the discriminative features from the original signals. Our study suggests that a level that slightly smaller than maximum allowed level is necessary to balance the trade-off between the classification performance and the sparsity of the input feature vector. The results shown in Figure 5 also suggest that classification performance for this considered data set does not obviously depend on the choice of kernel. Among all three cases considered, the best performance is obtained by using TI WKPCA with FLD classifier, which confirms our findings on the improvement of linearity of features using multiscale wavelet kernels. Thus, a non-linear classifier such as 1-NN may not necessarily outperform a linear classifier like FLD when WKPCA is used as a feature extraction method. This is because the linearity was improved by using the WKPCA and the 1-NN classifier performs better for clustered features than linear features. The classification accuracy is less affected by the feature dimension in PC space when the FLD classifier is used. However, TI WKPCA with 1-NN classifier achieves a higher accuracy when low-dimensional features are used for classification. This may suggest that it is beneficial to use the classification scheme that uses the multiscale wavelet KPCA plus a simple classifier including both linear and non-linear. The considered example demonstrates the applicability of the proposed method to multiscale data, and the proposed method could serve as an alternative approach to non-linear signal classification problems.

8. Conclusion and Discussion

This paper introduced a wavelet kernel PCA, in order to better capture data similarity measures in the kernel matrix. Multiscale wavelet kernels were constructed from a given mother wavelet function to improve the performance of KPCA as the feature extraction method in multiscale data classification. Based on analysis of the simulation data sets and the real data, we observed that the multiscale translation invariant wavelet kernel in KPCA has enhanced performance in feature extraction. The multiscale method for constructing a wavelet kernel in KPCA improves the robustness of KPCA in data classification as it tends to smooth out the locally modulated behavior caused by some types of mother wavelet. The application to real data was demonstrated through an EEG classification problem and the obtained results show the improvement of linearity after applying the multiscale WKPCA. Therefore, a simple linear classifier becomes suitable for classifying extracted features. This work focused on two important aspects: the first one was the construction of Mercer type wavelet kernel for kernel PCA and the second one was the investigation of the applicability of the proposed method.

The multiscale wavelet kernels proposed for the application in KPCA may also be useful for other kernel based methods in pattern recognition, such as support vector regression, kernel discriminant analysis, kernel density estimation, or curve fitting. Many kernel-based statistical methods require the optimization of kernel parameters, which is usually computationally expensive for high dimensional data. Because of this, the use of multiscale kernels is impractical as the computational cost is dramatically increased with the increase of the number of kernel parameters of a multiscale kernel. Instead, multiscale wavelet kernels enable to narrow down the search for the values of kernel parameters. This is because a linear combination of a set of multiple kernel functions constructed from a mother wavelet function is considered in this approach. It aims at capturing the multiscale components of the data. However, since the multiscale wavelet kernels are nonparametric, the performance of the kernel based methods using the multiscale wavelet kernels may not lead to an optimal solution to the problem.

Appendix

A. Proof of Theorem 1

Let 𝐱1,,𝐱𝑛𝑅𝑑 and 𝑟1,,𝑟𝑛𝑅. It is sufficient to prove that the kernel matrix 𝐊 is positive semi-definite. Since 𝑛𝑝𝑛𝑞𝑟𝑝𝑟𝑞𝑘(𝐱𝑝,𝐱𝑞)=𝑛𝑝𝑛𝑞𝑟𝑝𝑟𝑞𝑑𝑖=1𝐽1𝑗=0𝑁𝑘=01𝑎𝑗𝜓𝑥𝑖𝑏𝑘𝑎𝑗𝜓𝑦𝑖𝑏𝑘𝑎𝑗=𝑛𝑝𝑟𝑝𝑑𝑖=1𝐽1𝑗=0𝑁𝑘=01𝑎𝑗𝜓𝑥𝑝𝑏𝑔𝑎𝑗×𝑛𝑞𝑟𝑞𝑑𝑖=1𝐽1𝑗=0𝑁𝑘=01𝑎𝑗𝜓𝑥𝑞𝑏𝑔𝑎𝑗=𝑛𝑞𝑟𝑞𝑑𝑖=1𝐽1𝑗=0𝑁𝑘=01𝑎𝑗𝜓𝑥𝑞𝑏𝑔𝑎𝑗20.(A.1) Therefore, the kernel defined in (7) is a Mercer kernel. This completes the proof.

B. Proof of Theorem 2

Proof. By the Fourier condition theorem in [3], it is sufficient to prove that ̂𝑘(𝑤)=(2𝜋)𝑑/2𝑅𝑑exp(𝑗𝑤𝐱)𝑘(𝐱)𝑑𝐱0,(B.1) for all 𝑤, where 𝑘(𝐱)=𝑑𝑖=1cos5𝑥𝑖𝑎exp𝑥2𝑖2𝑎2.(B.2) Before we prove this fact, we first introduce the complex Morlet wavelet transform for a given signal 𝑠(𝑡). It is generally depicted as follows [28]: ̃𝑠(𝑎,𝜏)=𝑠(𝑡)1𝑎exp𝑗𝑘0𝑡𝜏𝑎(𝑡𝜏)22𝑎2𝑑𝑡,(B.3) where (1/𝑎)exp(𝑗𝑘0((𝑡𝜏)/𝑎)(𝑡𝜏)2/2𝑎2) is the dilated and translated version of the complex Morlet mother wavelet function exp(𝑗𝑘0𝑡)exp(𝑡2/2) and 𝜏,𝑎𝑅+ are the translation and dilation factors, respectively. By taking 𝑠(𝑡)=cos(𝑤𝑛𝑡) and setting 𝑎=𝑘0/𝑤, (B.3) becomes ̃𝑠(𝑎,𝜏)=cos𝑤𝑛𝑡exp𝑗𝑤(𝑡𝜏)(𝑡𝜏)22𝑎2𝑤𝑘0𝑑𝑡=𝜋𝑘02𝑤cos𝑤𝑛𝜏exp𝑤𝑛𝑤2𝑘202𝑤2+exp𝑤𝑛𝑤2𝑘202𝑤2+𝑗sin𝑤𝑛𝜏exp𝑤𝑛𝑤2𝑘202𝑤2exp𝑤𝑛𝑤2𝑘202𝑤2(B.4) If we set 𝜏=0,𝑤𝑛=5/𝑎, and 𝑡=𝑥𝑖 in (B.4), then we have ̃𝑠(𝑎,0)=1𝑎cos5𝑥𝑖𝑎exp𝑗𝑤𝑥𝑖exp𝑥2𝑖2𝑎2𝑑𝑥𝑖=𝜋𝑎2exp(5/𝑎𝑤)2𝑎22+exp(5/𝑎𝑤)2𝑎22.(B.5) Observe that by substituting (B.2) into (B.1) we get ̂𝑘(𝑤)=(2𝜋)𝑑/2𝑑𝑖=1exp𝑗𝑤𝑖𝑥𝑖𝑥2𝑖2𝑎2cos5𝑥𝑖𝑎𝑑𝑥𝑖.(B.6) Using the results in (B.4) and (B.5), ̂𝑘(𝑤) becomes ̂𝑘(𝑤)=2𝑑/2𝑑𝑖=1𝑎×exp(5/𝑎𝑤)2𝑎22+exp(5/𝑎𝑤)2𝑎220.(B.7) Therefore, the TI wavelet kernel constructed using the Morlet mother wavelet function is a Mercer kernel. This completes the proof.

Acknowledgments

S. Xie acknowledges the financial support from MITACS and Ryerson University, under MITACS Elevate Strategic Post-doctoral Award. P. Lio is supported by RECOGNITION: Relevance and Cognition for Self-Awareness in a Content-Centric Internet (257756), which is funded by the European Commission within the 7th Framework Programme (FP7).