Abstract
As a multichannel spatial filtering technique, common spatial patterns (CSP) have been successfully applied in braincomputer interfaces (BCI) community based on electroencephalogram (EEG). However, it is sensitive to outliers because of the employment of the L2norm in its formulation. It is beneficial to perform robust modelling for CSP. In this paper, we propose a robust framework, called CSPLp/q, by formulating the variances of two EEG classes with Lp and Lqnorms () separately. The method CSPLp/q with mixed Lp and Lqnorms takes the classwise difference into account in formulating the sample dispersion. We develop an iterative algorithm to optimize the objective function of CSPLp/q and show its monotonity theoretically. The superiority of the proposed CSPLp/q technique is experimentally demonstrated on three real EEG datasets of BCI competitions.
1. Introduction
The new communication channel between brain and machine is widely studied in the braincomputer interface (BCI) community based on electroencephalogram (EEG) [1]. It has potentially large application in the domains of biomedical engineering, rehabilitation, education, and entertainment [2]. The central issue in BCI is to accurately decode brain activities. It involves discriminative feature extraction of the brain signals [3].
In fact, the EEG signals are of multichannel time series. They have high temporal resolution. A large number of machine learning algorithms are used to perform the extraction of discriminative EEG features, where the samples are the EEG signals collected at different time points [4, 5]. For the brain activity of motor imagery (MI), there exists a phenomenon called eventrelated synchronization/desynchronization (ERS/ERD) [6]. That is, the brain energy in the corresponding region of the cerebral cortex increases in the band (8–13 Hz) and the band (13–30 Hz), while the rhythmic energy in the contralateral region significantly decreases [7, 8]. Based on the ERS/ERD principle, the technique of common spatial patterns (CSP) carries out spatial filtering so as to reflect the difference between two EEG classes [9–11]. By maximizing or minimizing the ratio of spatially filtered variance of the two EEG classes, a lot of literature address the application or development of CSP. For example, local temporal CSP (LTCSP) [12], comprehensive CSP (cCSP) [13], and spatialtemporal filter [14]. Some methods have been developed based on sparse learning for EEG analysis. For example, the optimization of spatial patterns using multiview learning within CSP [15], prediction of antidepressant response based on EEG signature [16], sparse Bayesian extreme learning machine for EEG classification [17], and sparse group representation model [18]. Also, some methods address the problem of channel selection, say, based on correction [19] and bispectrum [20]. Recently, a novel method based on CSP (called CSPLBP), which uses logarithmic band power (LBP) instead of variance to form the objective function, is developed to extract features [21]. CSPLBP redistributes the features so that they can be easily classified. The frequency domain CSP (FDCSP) employs the samples in the frequency domain as input signals of CSP [22]. The transformed data reduce the signal complexity, and some statistical features such as mean, median, maximum, and skewness are extracted. The classification results by the artificial neural network verify the effectiveness of FDCSP.
In the case of clean data, the classical CSPbased methods exhibit satisfactory performance in discriminative feature extraction. However, the EEG data is usually polluted by electromyogram (EMG), electrooculogram (EOG), and spikes. It is, therefore, important to resort to a robust avenue. We have ever developed L1normbased CSP (CSPL1) [23] for the purpose of robust modelling, in which the sample dispersion is formulated with the L1norm. The L1norm is then used as internal feature selection [24]. In contrast with the L2norm, the L1norm has the robust property from the perspective of statistical theory [25]. The L2norm involves the square operation, which would enlarge the negative effect of outliers associated with big deviations. In order to balance the big deviations, the resulting projection is likely to be distorted. The L1norm, however, uses the absolute value operation that could reduce the bad impact of outliers since the dominance of the big deviations is weakened. The L1norm is then generalized to the Lpnorm scenario for CSP (CSPLp) [26] as well as in machine learning [27]. More importantly, the two classes of EEG signals are collected under two different mental conditions, and different mental activities may reflect different intrinsic properties of EEG signals. Specifically, for the motor imageries of left and right hands, the brain demonstrates contralateral effect. The EEG signals recorded under the two mental conditions are related with different cerebral hemispheres and may have different expressions of neurophysiological information. It may be suggested unnecessarily to use the same norm to measure the band powers (i.e., dispersions of samples) of the two classes of EEG signals. It is expected to extract discriminative features by considering individual class of EEG signals specifically. If using two different norms to measure the dispersions of the two classes of samples, there may still be room to promote the capacity of spatial filtering.
In this paper, we consider further generalizing the filtering performance of CSPL1 by modelling individual class of EEG signals. Specifically, we define the dispersions of the two class of samples in the objective function by using the Lp and Lqnorms (). We term the proposed approach as CSPLp/q. It is noted that, in the field of machine learning, the mixed norms have shown great advantage in classification recently [28]. The CSPLp/q method aims to find the optimal projection vectors by optimizing the ratio of the Lpnorm dispersion of one class to the Lqnorm dispersion of the other class. The features are then extracted using the mixed Lp/qnorm dispersions. With the features, the classification is implemented via Fisher linear discriminant analysis (LDA) [29]. That is, the feature vectors produced on training trials are regarded as input samples, by which the objective function of the Fisher criterion is evaluated. According to the principle of maximizing the ratio of betweenclass to withinclass scatter, LDA seeks optimal projection vectors. The testing trials are then discriminated into the class that is nearest with the testing trials in terms of LDAprojected features. We point it that the proposed CSPLp/q approach has the following three advantages. (1) CSPLp/q incorporates robust modelling by defining the dispersions of the two classes of samples with different robustnessassociated norms. (2) CSPLp/q is a generalization of CSP, CSPL1, and CSPLp. Mathematically, if taking and , then the conventional CSP and CSPL1 methods occur, respectively. Likewise, CSPLp is a special case of CSPLp/q by setting . (3) An iterative procedure is given to optimize the objective function of CSPLp/q.
We organize the rest of this paper as follows. We briefly review the classical CSP and CSPL1 methods in Section 2. In Section 3, the CSPLp/q method, as well as its iterative procedure, is presented. The multiple spatial filters’ extension and feature extraction are also outlined in this section. Section 4 illustrates the experimental results on three real EEG datasets. Then, the paper is concluded in Section 5.
2. Conventional CSP and Its L1NormBased Version
The basic idea of the classical CSP is to contrast the difference of two EEG classes by using the spatial filtering technique. Assume that and are the EEG trials of the two classes, where represents the number of sensors adopted in the experiment, is the number of samples recorded within each epoch, and and are the numbers of trials of the two classes, respectively. We assume that EEG signals are already bandpass filtered, centered, and scaled.
Denote the whole trials of the two classes by and , and relabel the columns (i.e., normalized samples) of and as and , respectively. Note that and . Then, the covariance matrices of the two classes are simply formulated as and , where is the transpose operator.
Mathematically, the problem of seeking the spatial filter with CSP boils down to the optimization of the objective function given bywhere denotes the L2norm. It turns out to be an eigenvalue decomposition problem: . The few leading eigenvectors are associated with the largest or smallest eigenvalues, onto which the projected variances could produce discriminative features.
While CSP defines its objective function based on the L2norm, the CSPL1 approach, by contrast, utilizes the L1norm to define the objective function given bywhere denotes the L1norm which is the sum of the absolute values of entries. Due to the indifferentiality of the absolute function at the zero point, the optimization of CSPL1 is usually implemented by the iterative algorithm.
3. CSP with Mixed Lp and LqNorms Optimization
In the presence of outliers, CSPL1 exhibits advantage due to the robust modelling property of the L1norm. Further, it is a good practice to employ the Lpnorm for robust modelling since the L1norm may not be the optimal norm selection for robust modelling and its generalized Lpnorm is more flexible for accommodating varying EEG signals [26]. Due to the fact that EEG signals are collected under different mental conditions and may embody different intrinsic expression of neurophysiological information, it is beneficial to model the individual class of EEG signals specifically. Based on the basic neurophysiological principle, we propose using the Lpnorm to measure one class and the Lqnorm to measure the other class in defining the objective function as follows:where and denote the Lpnorm and the Lqnorm, respectively. The Lpnorm of a vector is defined as . We refer to (3) as mixed Lp and Lqnormsbased CSP and denote it by CSPLp/q. Clearly, in the case of , CSPLp/q turns out to be CSPL1. So, CSPL1 is a special case of CSPLp/q. Since the optimization (both maximization and minimization) of CSPLp/q is not straightforward due to the mixed norms, in the following, we design an iterative algorithm under the framework of bound optimization to calculate the filter bases.
3.1. Algorithmic Implementation
The maximization of the objective function of CSPLp/q is presented as follows.(1)Filter initialization: let be the iterative index and set , and take any dimensional vector with unit length as the initial spatial filter .(2)Polarity definition: define two polarity functions and as where is the sign function.(3)Gradient computation: compute the gradient vector as Here, in case of or , we then add a small turbulence to .(4)Filter update: update by where is a learning rate parameter which takes a small positive value. Since only the direction of the basis vector is of interest, is normalized to unit length.(5)Convergence inspection: if does not substantially augment the value of the objective function, i.e., is less than a threshold , then break the run and set . Otherwise, set , and return to Step 2.(6)Filter determination: output the spatial filter .
It is seen that the algorithmic complexity is per iteration. In reality, the total number of iterations and the specific EEG signals would determine the final computational complexity.
3.2. Algorithmic Justification
The algorithmic procedure provided above is justified by the following theorem.
Theorem 1. With the iterative procedure for CSPLp/q, the objective function is monotonically increased by each step of iteration .
We establish the theorem from the perspective of bound optimization [30, 31]. The bound optimization is a very powerful and general optimization framework. By using the bound optimization, we need to coin a surrogate function , which satisfies the requirement that the value of the function is smallest at the point . Then, for the surrogate function , the next basis is updated such that
Therefore, the bound optimization is an iterative procedure. Its convergence could be theoretically confirmed. Specifically, the monotonicity of the objective function is presented below [30]:
To achieve a surrogate function for maximizing , the following two lemmas are needed.
Lemma 1 (see [32]). Letting , for two vectors and , one has the inequalityand the equality holds when taking , where the operations and are applied to the elements of a vector and performs the elementwise product between two vectors.
Lemma 2 (see [33]). Letting , for vector and vector with nonzero elements, one has the variational inequality
If , the equality in the above expression is established, where is the diagonal matrix with being its diagonal elements.
We coin a surrogate function aswhere , , , and . The function can be served as a valid surrogate function due to the fact that and the equality holds when taking . The fact is demonstrated as follows.
We compare the numerator and the denominator of with the ones of , respectively. For simplicity, we drop out the two constants and in since they do not affect the computation of the basis vector. Firstly, we calculate the numerator. By letting and and substituting them into the inequality of Lemma 1, we have that
Clearly, when taking , i.e., , the equality in (13) appears. Secondly, we go on to consider the denominator. By letting and and substituting them into the inequality of Lemma 2, it follows that
Again, the equality in (14) arises when taking , i.e., . Combining (13) and (14), we have thatand when taking . This shows that the function given in (12) is a valid surrogate function.
As guaranteed by the principle of the bound optimization, to maximize, it suffices to iteratively augment the surrogate function at each iteration . Differentiating with respect to , we obtain the gradient value at the point given bywhich exactly is given in (6). That is, is the vector of the gradientascending direction of at the point . Consequently, by calculating the basis according to (7), we have , which implies that update step (7) satisfies condition (8).
Note that, in the above derivation, we in fact require that and (refer to the applicable conditions of Lemmas 1 and 2). In the case , we would additionally require . Then, we directly differentiate the objective function with respect to and obtain the gradient value at the point given bywhere we use the fact that the derivative of the sign function at the nonzero point is zero. The theorem is thus established.
Therefore, the theorem suggests that the iterative algorithm would converge to a locally maximum point. The globally maximum value of the objective function is possibly reached if different initializations are tried. Particularly, the initialization could be set as the basis vector of the conventional CSP method. The optimal basis vector corresponding to the maximal value of the objective function is thus recorded.
3.3. Multiple Spatial Filters’ Extension
Suppose are the first spatial filters obtained. The th spatial filter is selected such thatwhere is an dimensional identity matrix, is a matrix formulated as , and is an dimensional coefficient vector. That is, lies in the orthogonally complementary space which is spanned by . Substituting (18) into the objective function , we have that can be likely calculated by the bound optimization technique with the deflated samples:
Then, the basis vector is specified according to (18) using obtained. Consequently, the bases matrix is renewed by padding as , where is normalized such that its length is one. Due to the fact thatthe deflation procedure could be performed based on the previous results.
The iterative algorithm for obtaining spatial filters maximizing is formally summarized in Algorithm 1. The counterpart spatial filters maximizing are likewise performed.

3.4. Feature Extraction
We apply the multiple spatial filters to the EEG samples so as to extract features. That is, the set of orthonormal spatial filters maximizing the objective function and another set of orthonormal spatial filters maximizing the objective function are used as projection bases. The contrast of difference between the two classes is formulated with respect to the mixed Lp and Lqnorms. Therefore, for one single EEG trial , the feature vector is formed aswhere is a dimensional feature vector.
4. Experiments
We examine the performance of a family of CSPbased methods, i.e., CSPLp/q with varying and which include CSP, CSPL1, CSPLp, CSPLBP, and FDCSP. The experiments are performed on three datasets of BCI competitions: datasets III (a) and IV (a) of BCI competition III and dataset II (a) of BCI competition IV.
4.1. EEG Datasets
Three datasets of BCI competitions, which are publicly available, are employed to evaluate the performance of the proposed CSPLp/q approach in EEG classification. In the datasets, 17 subjects were cued to perform motor imageries of hands and tongue activities, and EEG signals were simultaneously recorded. In this paper, we are interested in binary classification (i.e., left versus right hand or right hand versus foot motor imageries).
4.1.1. Dataset III (a) of BCI Competition III
The EEG signals collected from three persons (s1, s2, and s3) were included in this dataset [34]. There were 60 electrodes employed in total, and the sampled frequency was of 250 Hz. The numbers of total EEG trials for s1, s2 and s3 were 180, 120, and 120, respectively. The total trials for each subject were equally divided into training and testing trials. That is, the numbers of training (also the testing) samples were 90, 60, and 60 for s1, s2, and s3, respectively. The purpose was to classify the testing trials into left and right hands’ motor imageries.
4.1.2. Dataset IV (a) of BCI Competition III
This dataset [34] was composed of five persons, i.e., aa, al, av, aw, and ay, with 118 electrodes. There were 280 EEG trials for each subject, and the numbers of the training trials for the five subjects were, respectively, 168, 224, 84, 56, and 28. The rest trials were used as the testing trials. We were interested in classifying right hand and foot motor imageries.
4.1.3. Dataset II (a) of BCI Competition IV
The EEG signals in this dataset were generated from nine persons (A01E, , A09E), who were cued to perform the activity of four different motor imageries, i.e., left hand, right hand, foot, and tongue [35]. Twenty two electrodes were placed on the scalp to measure EEG signals. For each subject, there were 288 trials in each session with 72 trials per condition. In our experiment, one session of trials was used as training data, while the unlabelled trials in the other session as testing data. Consequently, the number of the training (also the testing) was 144 for each subject since we only perform the classification of the left and the right hand motor imageries.
4.2. Preprocessing and Settings of Experiments
A fifthorder Butterworth filter with the frequency band between 8 Hz and 35 Hz (i.e., cover and bands) was applied to the EEG signals. For the second dataset, EEG time series were windowed in the range of 0.5 s to 3.75 s relative to the visual stimulus, while for the other two datasets, the interval was set as 0.5 s to 2.5 s.
The initial projection bases of CSPLp/q were set as the corresponding solutions of the conventional CSP method. In the iteration procedure of CSPLp/q, the learning rate parameter varied from 1 − 6 to 9 − 4. The convergence threshold was set as 1 − 5. As commonly suggested in [9], we adopted three pairs of projection bases (i.e., six components) to filter the EEG samples in carrying out CSP and CSPLp/q. As confirmed by the experiments, the classification accuracy would not increase if the number of components became large. By using LDA, the sixdimensional feature vectors (which were produced by CSPLp/q on the training trials) were used as samples to train LDA. Consequently, by maximizing the objective function of the Fisher criterion, we obtained one optimal basis vector, onto which the feature vectors were projected. The resulting onedimensional scalars were then fed into the nearestneighbor classifier. That is, the labels of the testing trials were endowed as the labels of the nearest training trials.
The parameters and were determined by using the strategy of twofold crossvalidation. That was, for each combination of and , the filters were obtained based on the training set (subset of the total training set), and then, the classification accuracy was evaluated on the validation set (the rest of the total training set). This procedure was repeated two times, and in each time, the validation set was different. We used the average of the classification rates across the two folds as classification accuracy. We selected the combination of and that resulted in the maximum average classification accuracy. Then, we recomputed the filters using all the two folds with the optimal combination determined. With the filters obtained, the generalization capacity was estimated on the testing trials, in which the testing process was not concerned with the training of the filters and the parameters.
4.3. Outliers Simulation
To highlight the classification performance of the proposed technique in the situation of outliers with varying and , artificial outliers were introduced into the training samples. Specifically, we design the randomly generated outlier from an dimensional Gaussian distribution , where and were the quantities of channelwise mean and standard deviation of all the training samples, respectively, and was the covariance matrix of the training samples. These artificial outlier points were imposed to the original EEG training samples at randomly selected positions. For each incorporation, the experiments were repeated ten times. The mean classification accuracy was figured.
4.4. Experimental Results
We investigated the classification accuracy of the proposed CSPLp/q approach with varying and in the range on the three datasets. Particularly, CSPLp/q becomes the conventional CSP method when taking and turns out to be CSPL1 in the case of . For visual perception, the classification accuracies of CSPLp/q with different values of and on all the 17 subjects of the three datasets are shown in Figure 1, where the step length of and was set as 0.1. To clearly see the difference between classification accuracies with different and values, the quantities on subject s1 are shown in Table 1 as an example. It was seen that most of the classification accuracies exceeded 70% on the three datasets, which demonstrated the effectiveness of the CSPbased spatial filtering techniques. From Figure 1, it was observed that, in most cases, the highest classification accuracy of CSPLp/q was obtained when and took different values. It thus demonstrated the necessity of the utilization of different norms in modelling different classes of EEG samples. It was also noted that, in most cases, the optimal values of and were less than 1. It may imply that, for EEG signals, small norms were preferred so as to accommodate outliers accompanied. For visual perception, examples of spatial filters obtained by different CSP methods are delineated in Figure 2, where FDCSP was not shown since the frequency quantities were used as input data.
(a)
(b)
(c)
(d)
The classification accuracies of CSPLp/q, as well as CSP, CSPLBP, FDCSP, and CSPL1, on the 17 subjects are shown in Table 2, where CSPLp/q was compared with the benchmark method CSP and with three stateoftheart methods: CSPLBP [21], FDCSP [22], and CSPL1 [23]. For different subjects, the classification results obtained were quite different, which further illustrated individual difference in the BCI domain. It was straightforward that the average classification accuracy of CSPLp/q was higher than that of CSP (i.e., ) and CSPL1 (i.e., ) since CSP and CSPL1 were the two special cases of CSPLp/q. Also, CSPLp/q was superior than CSPLBP and FDCSP. Moreover, the classification accuracies of 7 out of 17 subjects have reached more than 90% when using CSPLp/q.
The experimental comparison of training times on subject s1 costed by the five methods (i.e., CSPLp/q, CSP, CSPLBP, FDCSP, and CSPL1) were 91s, 13s, 42s, 78s, and 43s, respectively. Here, the experiments were run on the platform of Lenovo ThinkPad with 2.90 GHz CPU and 8.00 GB RAM. It was not surprising that the iterative algorithms consumed more training times. Especially, the CSPLp/q method needed additional time to train the parameters and .
Statistical analysis was implemented to investigate the significance of performance differences between different methods. We adopted the nonparametric Friedman test [36] to meet the purpose since the classification accuracies were reported on the same subjects and then were correlated. During the test, the classification accuracies produced by each method were regarded as a group, and blocks were formed with different subjects. With the rank statistic of the test, it was indicated that the performance differences between different methods were significant (). Further, the Friedman rank sum test [37] was conducted to perform the post hoc multiple comparison. It was found that the proposed CSPLp/q method exceeded the other approaches ().
The classification accuracies of the five methods (i.e., CSP, CSPLBP, FDCSP, CSPL1, and CSPLp/q) in the case of increasing occurrence frequencies of outliers on the three datasets are presented in Figures 3–5. Due to individual difference, the classification accuracies varied on different subjects when outliers were introduced. From these figures, it was seen that the classification accuracies of most methods declined with the increased frequency of outliers. Clearly, on almost all subjects, the conventional CSP method and CSPLBP had the sharpest decrease. That is, the two methods were influenced by the outliers badly. Due to the introduction of frequency, FDCSP was better than CSP and CSPLBP. However, CSP, CSPLBP, and FDCSP are essentially based on the L2norm, which is sensitive to outliers. By contrast, the classification accuracy of CSPL1 had relatively mild decrease, which demonstrated the effect of the L1norm in suppressing the outliers. CSPL1 is based on the L1norm, which, however, is a special case of CSPLp/q (i.e., ). Generally, among the five methods, CSPLp/q had relative superior classification accuracies on the three datasets. In these situations, the decreasing range was small, and the classification accuracies were still satisfactory on most subjects. Note that the optimal combination of and may vary in different occurrence frequencies of outliers. Again, the advantage of selecting appropriate norms for robust modelling was illustrated.
(a)
(b)
(c)
(d)
(a)
(b)
(c)
(d)
(e)
(f)
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
5. Conclusion
In this paper, we proposed a new approach, called CSPLp/q, by using the mixed Lp and Lqnorms for modelling two classes of EEG signals individually. The objective function of the CSPLp/q approach was defined by reformulating the sample dispersion of one class with the Lpnorm, while the other class with the Lqnorm. It took the class difference into account fully when defining sample dispersion. Thanks to the mixnorms adopted, the CSPLp/q approach was a robust generalization of CSP, CSPL1, and CSPLp. The classification performance of the proposed CSPLp/q technique was demonstrated by the experiments on three datasets of BCI competitions.
One limitation of the proposed method was that the optimal combination of and in CSPLp/q may depend on the data on hand. This was the issue of model selection. A practical avenue was to use the strategy of cross validation to determine them. It was an interesting but challenging issue to set it analytically. Another limitation was that, compared with the traditional CSP method, the implementation of the proposed method was in an iterative way. The determination of the step length in the gradientascent procedure, however, was not a trivial issue since inappropriate setting would make the iteration trapped into local optimization. We are currently studying the elementwise update with the closeform solution. Further, the proposed method could be extended into the online (incremental) version, which is useful for the online BCI system. Finally, we pointed out that the proposed technique of adopting Lp and Lqnorms was a general framework for neuroscience decoding. It could be applied to other relative spatial patterns, say, discriminative spatial patterns [38].
Data Availability
All data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the National Natural Science Foundation of China under Grant 61773114 and University Synergy Innovation Program of Anhui Province under Grant GXXT2020015.