Abstract

As a multichannel spatial filtering technique, common spatial patterns (CSP) have been successfully applied in brain-computer interfaces (BCI) community based on electroencephalogram (EEG). However, it is sensitive to outliers because of the employment of the L2-norm in its formulation. It is beneficial to perform robust modelling for CSP. In this paper, we propose a robust framework, called CSP-Lp/q, by formulating the variances of two EEG classes with Lp- and Lq-norms () separately. The method CSP-Lp/q with mixed Lp- and Lq-norms takes the class-wise difference into account in formulating the sample dispersion. We develop an iterative algorithm to optimize the objective function of CSP-Lp/q and show its monotonity theoretically. The superiority of the proposed CSP-Lp/q technique is experimentally demonstrated on three real EEG datasets of BCI competitions.

1. Introduction

The new communication channel between brain and machine is widely studied in the brain-computer interface (BCI) community based on electroencephalogram (EEG) [1]. It has potentially large application in the domains of biomedical engineering, rehabilitation, education, and entertainment [2]. The central issue in BCI is to accurately decode brain activities. It involves discriminative feature extraction of the brain signals [3].

In fact, the EEG signals are of multichannel time series. They have high temporal resolution. A large number of machine learning algorithms are used to perform the extraction of discriminative EEG features, where the samples are the EEG signals collected at different time points [4, 5]. For the brain activity of motor imagery (MI), there exists a phenomenon called event-related synchronization/desynchronization (ERS/ERD) [6]. That is, the brain energy in the corresponding region of the cerebral cortex increases in the -band (8–13 Hz) and the -band (13–30 Hz), while the rhythmic energy in the contralateral region significantly decreases [7, 8]. Based on the ERS/ERD principle, the technique of common spatial patterns (CSP) carries out spatial filtering so as to reflect the difference between two EEG classes [911]. By maximizing or minimizing the ratio of spatially filtered variance of the two EEG classes, a lot of literature address the application or development of CSP. For example, local temporal CSP (LTCSP) [12], comprehensive CSP (cCSP) [13], and spatial-temporal filter [14]. Some methods have been developed based on sparse learning for EEG analysis. For example, the optimization of spatial patterns using multiview learning within CSP [15], prediction of antidepressant response based on EEG signature [16], sparse Bayesian extreme learning machine for EEG classification [17], and sparse group representation model [18]. Also, some methods address the problem of channel selection, say, based on correction [19] and bispectrum [20]. Recently, a novel method based on CSP (called CSP-LBP), which uses logarithmic band power (LBP) instead of variance to form the objective function, is developed to extract features [21]. CSP-LBP re-distributes the features so that they can be easily classified. The frequency domain CSP (FDCSP) employs the samples in the frequency domain as input signals of CSP [22]. The transformed data reduce the signal complexity, and some statistical features such as mean, median, maximum, and skewness are extracted. The classification results by the artificial neural network verify the effectiveness of FDCSP.

In the case of clean data, the classical CSP-based methods exhibit satisfactory performance in discriminative feature extraction. However, the EEG data is usually polluted by electromyogram (EMG), electrooculogram (EOG), and spikes. It is, therefore, important to resort to a robust avenue. We have ever developed L1-norm-based CSP (CSP-L1) [23] for the purpose of robust modelling, in which the sample dispersion is formulated with the L1-norm. The L1-norm is then used as internal feature selection [24]. In contrast with the L2-norm, the L1-norm has the robust property from the perspective of statistical theory [25]. The L2-norm involves the square operation, which would enlarge the negative effect of outliers associated with big deviations. In order to balance the big deviations, the resulting projection is likely to be distorted. The L1-norm, however, uses the absolute value operation that could reduce the bad impact of outliers since the dominance of the big deviations is weakened. The L1-norm is then generalized to the Lp-norm scenario for CSP (CSP-Lp) [26] as well as in machine learning [27]. More importantly, the two classes of EEG signals are collected under two different mental conditions, and different mental activities may reflect different intrinsic properties of EEG signals. Specifically, for the motor imageries of left and right hands, the brain demonstrates contralateral effect. The EEG signals recorded under the two mental conditions are related with different cerebral hemispheres and may have different expressions of neurophysiological information. It may be suggested unnecessarily to use the same norm to measure the band powers (i.e., dispersions of samples) of the two classes of EEG signals. It is expected to extract discriminative features by considering individual class of EEG signals specifically. If using two different norms to measure the dispersions of the two classes of samples, there may still be room to promote the capacity of spatial filtering.

In this paper, we consider further generalizing the filtering performance of CSP-L1 by modelling individual class of EEG signals. Specifically, we define the dispersions of the two class of samples in the objective function by using the Lp- and Lq-norms (). We term the proposed approach as CSP-Lp/q. It is noted that, in the field of machine learning, the mixed norms have shown great advantage in classification recently [28]. The CSP-Lp/q method aims to find the optimal projection vectors by optimizing the ratio of the Lp-norm dispersion of one class to the Lq-norm dispersion of the other class. The features are then extracted using the mixed Lp/q-norm dispersions. With the features, the classification is implemented via Fisher linear discriminant analysis (LDA) [29]. That is, the feature vectors produced on training trials are regarded as input samples, by which the objective function of the Fisher criterion is evaluated. According to the principle of maximizing the ratio of between-class to within-class scatter, LDA seeks optimal projection vectors. The testing trials are then discriminated into the class that is nearest with the testing trials in terms of LDA-projected features. We point it that the proposed CSP-Lp/q approach has the following three advantages. (1) CSP-Lp/q incorporates robust modelling by defining the dispersions of the two classes of samples with different robustness-associated norms. (2) CSP-Lp/q is a generalization of CSP, CSP-L1, and CSP-Lp. Mathematically, if taking and , then the conventional CSP and CSP-L1 methods occur, respectively. Likewise, CSP-Lp is a special case of CSP-Lp/q by setting . (3) An iterative procedure is given to optimize the objective function of CSP-Lp/q.

We organize the rest of this paper as follows. We briefly review the classical CSP and CSP-L1 methods in Section 2. In Section 3, the CSP-Lp/q method, as well as its iterative procedure, is presented. The multiple spatial filters’ extension and feature extraction are also outlined in this section. Section 4 illustrates the experimental results on three real EEG datasets. Then, the paper is concluded in Section 5.

2. Conventional CSP and Its L1-Norm-Based Version

The basic idea of the classical CSP is to contrast the difference of two EEG classes by using the spatial filtering technique. Assume that and are the EEG trials of the two classes, where represents the number of sensors adopted in the experiment, is the number of samples recorded within each epoch, and and are the numbers of trials of the two classes, respectively. We assume that EEG signals are already band-pass filtered, centered, and scaled.

Denote the whole trials of the two classes by and , and relabel the columns (i.e., normalized samples) of and as and , respectively. Note that and . Then, the covariance matrices of the two classes are simply formulated as and , where is the transpose operator.

Mathematically, the problem of seeking the spatial filter with CSP boils down to the optimization of the objective function given bywhere denotes the L2-norm. It turns out to be an eigenvalue decomposition problem: . The few leading eigenvectors are associated with the largest or smallest eigenvalues, onto which the projected variances could produce discriminative features.

While CSP defines its objective function based on the L2-norm, the CSP-L1 approach, by contrast, utilizes the L1-norm to define the objective function given bywhere denotes the L1-norm which is the sum of the absolute values of entries. Due to the indifferentiality of the absolute function at the zero point, the optimization of CSP-L1 is usually implemented by the iterative algorithm.

3. CSP with Mixed Lp- and Lq-Norms Optimization

In the presence of outliers, CSP-L1 exhibits advantage due to the robust modelling property of the L1-norm. Further, it is a good practice to employ the Lp-norm for robust modelling since the L1-norm may not be the optimal norm selection for robust modelling and its generalized Lp-norm is more flexible for accommodating varying EEG signals [26]. Due to the fact that EEG signals are collected under different mental conditions and may embody different intrinsic expression of neurophysiological information, it is beneficial to model the individual class of EEG signals specifically. Based on the basic neurophysiological principle, we propose using the Lp-norm to measure one class and the Lq-norm to measure the other class in defining the objective function as follows:where and denote the Lp-norm and the Lq-norm, respectively. The Lp-norm of a vector is defined as . We refer to (3) as mixed Lp- and Lq-norms-based CSP and denote it by CSP-Lp/q. Clearly, in the case of , CSP-Lp/q turns out to be CSP-L1. So, CSP-L1 is a special case of CSP-Lp/q. Since the optimization (both maximization and minimization) of CSP-Lp/q is not straightforward due to the mixed norms, in the following, we design an iterative algorithm under the framework of bound optimization to calculate the filter bases.

3.1. Algorithmic Implementation

The maximization of the objective function of CSP-Lp/q is presented as follows.(1)Filter initialization: let be the iterative index and set , and take any -dimensional vector with unit length as the initial spatial filter .(2)Polarity definition: define two polarity functions and aswhere is the sign function.(3)Gradient computation: compute the gradient vector asHere, in case of or , we then add a small turbulence to .(4)Filter update: update bywhere is a learning rate parameter which takes a small positive value. Since only the direction of the basis vector is of interest, is normalized to unit length.(5)Convergence inspection: if does not substantially augment the value of the objective function, i.e., is less than a threshold , then break the run and set . Otherwise, set , and return to Step 2.(6)Filter determination: output the spatial filter .

It is seen that the algorithmic complexity is per iteration. In reality, the total number of iterations and the specific EEG signals would determine the final computational complexity.

3.2. Algorithmic Justification

The algorithmic procedure provided above is justified by the following theorem.

Theorem 1. With the iterative procedure for CSP-Lp/q, the objective function is monotonically increased by each step of iteration .

We establish the theorem from the perspective of bound optimization [30, 31]. The bound optimization is a very powerful and general optimization framework. By using the bound optimization, we need to coin a surrogate function , which satisfies the requirement that the value of the function is smallest at the point . Then, for the surrogate function , the next basis is updated such that

Therefore, the bound optimization is an iterative procedure. Its convergence could be theoretically confirmed. Specifically, the monotonicity of the objective function is presented below [30]:

To achieve a surrogate function for maximizing , the following two lemmas are needed.

Lemma 1 (see [32]). Letting , for two vectors and , one has the inequalityand the equality holds when taking , where the operations and are applied to the elements of a vector and performs the element-wise product between two vectors.

Lemma 2 (see [33]). Letting , for vector and vector with nonzero elements, one has the variational inequality

If , the equality in the above expression is established, where is the diagonal matrix with being its diagonal elements.

We coin a surrogate function aswhere , , , and . The function can be served as a valid surrogate function due to the fact that and the equality holds when taking . The fact is demonstrated as follows.

We compare the numerator and the denominator of with the ones of , respectively. For simplicity, we drop out the two constants and in since they do not affect the computation of the basis vector. Firstly, we calculate the numerator. By letting and and substituting them into the inequality of Lemma 1, we have that

Clearly, when taking , i.e., , the equality in (13) appears. Secondly, we go on to consider the denominator. By letting and and substituting them into the inequality of Lemma 2, it follows that

Again, the equality in (14) arises when taking , i.e., . Combining (13) and (14), we have thatand when taking . This shows that the function given in (12) is a valid surrogate function.

As guaranteed by the principle of the bound optimization, to maximize, it suffices to iteratively augment the surrogate function at each iteration . Differentiating with respect to , we obtain the gradient value at the point given bywhich exactly is given in (6). That is, is the vector of the gradient-ascending direction of at the point . Consequently, by calculating the basis according to (7), we have , which implies that update step (7) satisfies condition (8).

Note that, in the above derivation, we in fact require that and (refer to the applicable conditions of Lemmas 1 and 2). In the case , we would additionally require . Then, we directly differentiate the objective function with respect to and obtain the gradient value at the point given bywhere we use the fact that the derivative of the sign function at the nonzero point is zero. The theorem is thus established.

Therefore, the theorem suggests that the iterative algorithm would converge to a locally maximum point. The globally maximum value of the objective function is possibly reached if different initializations are tried. Particularly, the initialization could be set as the basis vector of the conventional CSP method. The optimal basis vector corresponding to the maximal value of the objective function is thus recorded.

3.3. Multiple Spatial Filters’ Extension

Suppose are the first spatial filters obtained. The th spatial filter is selected such thatwhere is an -dimensional identity matrix, is a matrix formulated as , and is an -dimensional coefficient vector. That is, lies in the orthogonally complementary space which is spanned by . Substituting (18) into the objective function , we have that can be likely calculated by the bound optimization technique with the deflated samples:

Then, the basis vector is specified according to (18) using obtained. Consequently, the bases matrix is renewed by padding as , where is normalized such that its length is one. Due to the fact thatthe deflation procedure could be performed based on the previous results.

The iterative algorithm for obtaining spatial filters maximizing is formally summarized in Algorithm 1. The counterpart spatial filters maximizing are likewise performed.

Input: Training data and , norm quantities p and q, and number of spatial filters K.
Output: Spatial filters .
(1)Let , and set as the -dimensional identity matrix.
(2)Compute the coefficient vector on the samples and according to the iterative procedure in Section 3.1.
(3)Compute the spatial filter as
(4)Deflate the samples as
and update as
(5)If , then let and return to Step 2. Otherwise, terminate the run and output the spatial filters .
3.4. Feature Extraction

We apply the multiple spatial filters to the EEG samples so as to extract features. That is, the set of orthonormal spatial filters maximizing the objective function and another set of orthonormal spatial filters maximizing the objective function are used as projection bases. The contrast of difference between the two classes is formulated with respect to the mixed Lp- and Lq-norms. Therefore, for one single EEG trial , the feature vector is formed aswhere is a -dimensional feature vector.

4. Experiments

We examine the performance of a family of CSP-based methods, i.e., CSP-Lp/q with varying and which include CSP, CSP-L1, CSP-Lp, CSP-LBP, and FDCSP. The experiments are performed on three datasets of BCI competitions: datasets III (a) and IV (a) of BCI competition III and dataset II (a) of BCI competition IV.

4.1. EEG Datasets

Three datasets of BCI competitions, which are publicly available, are employed to evaluate the performance of the proposed CSP-Lp/q approach in EEG classification. In the datasets, 17 subjects were cued to perform motor imageries of hands and tongue activities, and EEG signals were simultaneously recorded. In this paper, we are interested in binary classification (i.e., left versus right hand or right hand versus foot motor imageries).

4.1.1. Dataset III (a) of BCI Competition III

The EEG signals collected from three persons (s1, s2, and s3) were included in this dataset [34]. There were 60 electrodes employed in total, and the sampled frequency was of 250 Hz. The numbers of total EEG trials for s1, s2 and s3 were 180, 120, and 120, respectively. The total trials for each subject were equally divided into training and testing trials. That is, the numbers of training (also the testing) samples were 90, 60, and 60 for s1, s2, and s3, respectively. The purpose was to classify the testing trials into left and right hands’ motor imageries.

4.1.2. Dataset IV (a) of BCI Competition III

This dataset [34] was composed of five persons, i.e., aa, al, av, aw, and ay, with 118 electrodes. There were 280 EEG trials for each subject, and the numbers of the training trials for the five subjects were, respectively, 168, 224, 84, 56, and 28. The rest trials were used as the testing trials. We were interested in classifying right hand and foot motor imageries.

4.1.3. Dataset II (a) of BCI Competition IV

The EEG signals in this dataset were generated from nine persons (A01E, , A09E), who were cued to perform the activity of four different motor imageries, i.e., left hand, right hand, foot, and tongue [35]. Twenty two electrodes were placed on the scalp to measure EEG signals. For each subject, there were 288 trials in each session with 72 trials per condition. In our experiment, one session of trials was used as training data, while the unlabelled trials in the other session as testing data. Consequently, the number of the training (also the testing) was 144 for each subject since we only perform the classification of the left and the right hand motor imageries.

4.2. Preprocessing and Settings of Experiments

A fifth-order Butterworth filter with the frequency band between 8 Hz and 35 Hz (i.e., cover and bands) was applied to the EEG signals. For the second dataset, EEG time series were windowed in the range of 0.5 s to 3.75 s relative to the visual stimulus, while for the other two datasets, the interval was set as 0.5 s to 2.5 s.

The initial projection bases of CSP-Lp/q were set as the corresponding solutions of the conventional CSP method. In the iteration procedure of CSP-Lp/q, the learning rate parameter varied from 1 − 6 to 9 − 4. The convergence threshold was set as 1 − 5. As commonly suggested in [9], we adopted three pairs of projection bases (i.e., six components) to filter the EEG samples in carrying out CSP and CSP-Lp/q. As confirmed by the experiments, the classification accuracy would not increase if the number of components became large. By using LDA, the six-dimensional feature vectors (which were produced by CSP-Lp/q on the training trials) were used as samples to train LDA. Consequently, by maximizing the objective function of the Fisher criterion, we obtained one optimal basis vector, onto which the feature vectors were projected. The resulting one-dimensional scalars were then fed into the nearest-neighbor classifier. That is, the labels of the testing trials were endowed as the labels of the nearest training trials.

The parameters and were determined by using the strategy of two-fold cross-validation. That was, for each combination of and , the filters were obtained based on the training set (subset of the total training set), and then, the classification accuracy was evaluated on the validation set (the rest of the total training set). This procedure was repeated two times, and in each time, the validation set was different. We used the average of the classification rates across the two folds as classification accuracy. We selected the combination of and that resulted in the maximum average classification accuracy. Then, we re-computed the filters using all the two folds with the optimal combination determined. With the filters obtained, the generalization capacity was estimated on the testing trials, in which the testing process was not concerned with the training of the filters and the parameters.

4.3. Outliers Simulation

To highlight the classification performance of the proposed technique in the situation of outliers with varying and , artificial outliers were introduced into the training samples. Specifically, we design the randomly generated outlier from an -dimensional Gaussian distribution , where and were the quantities of channel-wise mean and standard deviation of all the training samples, respectively, and was the covariance matrix of the training samples. These artificial outlier points were imposed to the original EEG training samples at randomly selected positions. For each incorporation, the experiments were repeated ten times. The mean classification accuracy was figured.

4.4. Experimental Results

We investigated the classification accuracy of the proposed CSP-Lp/q approach with varying and in the range on the three datasets. Particularly, CSP-Lp/q becomes the conventional CSP method when taking and turns out to be CSP-L1 in the case of . For visual perception, the classification accuracies of CSP-Lp/q with different values of and on all the 17 subjects of the three datasets are shown in Figure 1, where the step length of and was set as 0.1. To clearly see the difference between classification accuracies with different and values, the quantities on subject s1 are shown in Table 1 as an example. It was seen that most of the classification accuracies exceeded 70% on the three datasets, which demonstrated the effectiveness of the CSP-based spatial filtering techniques. From Figure 1, it was observed that, in most cases, the highest classification accuracy of CSP-Lp/q was obtained when and took different values. It thus demonstrated the necessity of the utilization of different norms in modelling different classes of EEG samples. It was also noted that, in most cases, the optimal values of and were less than 1. It may imply that, for EEG signals, small norms were preferred so as to accommodate outliers accompanied. For visual perception, examples of spatial filters obtained by different CSP methods are delineated in Figure 2, where FDCSP was not shown since the frequency quantities were used as input data.

The classification accuracies of CSP-Lp/q, as well as CSP, CSP-LBP, FDCSP, and CSP-L1, on the 17 subjects are shown in Table 2, where CSP-Lp/q was compared with the benchmark method CSP and with three state-of-the-art methods: CSP-LBP [21], FDCSP [22], and CSP-L1 [23]. For different subjects, the classification results obtained were quite different, which further illustrated individual difference in the BCI domain. It was straightforward that the average classification accuracy of CSP-Lp/q was higher than that of CSP (i.e., ) and CSP-L1 (i.e., ) since CSP and CSP-L1 were the two special cases of CSP-Lp/q. Also, CSP-Lp/q was superior than CSP-LBP and FDCSP. Moreover, the classification accuracies of 7 out of 17 subjects have reached more than 90% when using CSP-Lp/q.

The experimental comparison of training times on subject s1 costed by the five methods (i.e., CSP-Lp/q, CSP, CSP-LBP, FDCSP, and CSP-L1) were 91s, 13s, 42s, 78s, and 43s, respectively. Here, the experiments were run on the platform of Lenovo ThinkPad with 2.90 GHz CPU and 8.00 GB RAM. It was not surprising that the iterative algorithms consumed more training times. Especially, the CSP-Lp/q method needed additional time to train the parameters and .

Statistical analysis was implemented to investigate the significance of performance differences between different methods. We adopted the nonparametric Friedman test [36] to meet the purpose since the classification accuracies were reported on the same subjects and then were correlated. During the test, the classification accuracies produced by each method were regarded as a group, and blocks were formed with different subjects. With the rank statistic of the test, it was indicated that the performance differences between different methods were significant (). Further, the Friedman rank sum test [37] was conducted to perform the post hoc multiple comparison. It was found that the proposed CSP-Lp/q method exceeded the other approaches ().

The classification accuracies of the five methods (i.e., CSP, CSP-LBP, FDCSP, CSP-L1, and CSP-Lp/q) in the case of increasing occurrence frequencies of outliers on the three datasets are presented in Figures 35. Due to individual difference, the classification accuracies varied on different subjects when outliers were introduced. From these figures, it was seen that the classification accuracies of most methods declined with the increased frequency of outliers. Clearly, on almost all subjects, the conventional CSP method and CSP-LBP had the sharpest decrease. That is, the two methods were influenced by the outliers badly. Due to the introduction of frequency, FDCSP was better than CSP and CSP-LBP. However, CSP, CSP-LBP, and FDCSP are essentially based on the L2-norm, which is sensitive to outliers. By contrast, the classification accuracy of CSP-L1 had relatively mild decrease, which demonstrated the effect of the L1-norm in suppressing the outliers. CSP-L1 is based on the L1-norm, which, however, is a special case of CSP-Lp/q (i.e., ). Generally, among the five methods, CSP-Lp/q had relative superior classification accuracies on the three datasets. In these situations, the decreasing range was small, and the classification accuracies were still satisfactory on most subjects. Note that the optimal combination of and may vary in different occurrence frequencies of outliers. Again, the advantage of selecting appropriate norms for robust modelling was illustrated.

5. Conclusion

In this paper, we proposed a new approach, called CSP-Lp/q, by using the mixed Lp- and Lq-norms for modelling two classes of EEG signals individually. The objective function of the CSP-Lp/q approach was defined by reformulating the sample dispersion of one class with the Lp-norm, while the other class with the Lq-norm. It took the class difference into account fully when defining sample dispersion. Thanks to the mix-norms adopted, the CSP-Lp/q approach was a robust generalization of CSP, CSP-L1, and CSP-Lp. The classification performance of the proposed CSP-Lp/q technique was demonstrated by the experiments on three datasets of BCI competitions.

One limitation of the proposed method was that the optimal combination of and in CSP-Lp/q may depend on the data on hand. This was the issue of model selection. A practical avenue was to use the strategy of cross validation to determine them. It was an interesting but challenging issue to set it analytically. Another limitation was that, compared with the traditional CSP method, the implementation of the proposed method was in an iterative way. The determination of the step length in the gradient-ascent procedure, however, was not a trivial issue since inappropriate setting would make the iteration trapped into local optimization. We are currently studying the element-wise update with the close-form solution. Further, the proposed method could be extended into the online (incremental) version, which is useful for the online BCI system. Finally, we pointed out that the proposed technique of adopting Lp- and Lq-norms was a general framework for neuroscience decoding. It could be applied to other relative spatial patterns, say, discriminative spatial patterns [38].

Data Availability

All data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant 61773114 and University Synergy Innovation Program of Anhui Province under Grant GXXT-2020-015.