Abstract

Emotion recognition based on multichannel electroencephalogram (EEG) signals is a key research area in the field of affective computing. Traditional methods extract EEG features from each channel based on extensive domain knowledge and ignore the spatial characteristics and global synchronization information across all channels. This paper proposes a global feature extraction method that encapsulates the multichannel EEG signals into gray images. The maximal information coefficient (MIC) for all channels was first measured. Subsequently, an MIC matrix was constructed according to the electrode arrangement rules and represented by an MIC gray image. Finally, a deep learning model designed with two principal component analysis convolutional layers and a nonlinear transformation operation extracted the spatial characteristics and global interchannel synchronization features from the constructed feature images, which were then input to support vector machines to perform the emotion recognition tasks. Experiments were conducted on the benchmark dataset for emotion analysis using EEG, physiological, and video signals. The experimental results demonstrated that the global synchronization features and spatial characteristics are beneficial for recognizing emotions and the proposed deep learning model effectively mines and utilizes the two salient features.

1. Introduction

As an advanced function of the human brain, emotion plays an important role in daily human life. Emotional expressions enable easier communication, and emotional states can affect a person’s work and learning. Therefore, emotion recognition offers a high application value and broad prospects in the fields of medicine, education, intelligent systems, human-computer interactions, and commerce and has become a research area of great interest [1].

Transitions between emotion states are accompanied by complex neural processes and physiological changes. In addition to facial expressions [2], speech [35], and body movement [6, 7], electrophysiological signals and endocrine-related indicators can reflect changes in emotion states [810] as well. However, these same physical characteristics are easily affected by the subjective will of a person as well as the external environment. On the other hand, emotion recognition through analyzing physiological electrical signals is relatively objective. Therefore, investigating the relationship between physiological signals, such as electroencephalogram (EEG) and emotion states, has garnered considerable attention [1113].

A critical step in the EEG-based emotion recognition task is to extract features related to human emotional states from multichannel EEG signals. Various time-domain, frequency-domain, and time-frequency-domain features have been proposed in previous studies. The time-domain features include statistical features (such as power, mean, and standard deviation) [1416], event-correlated potentials (ERP) [17], Hjorth features (e.g., mobility, activity, and complexity) [18], nonstationary index [19], and higher-order crossing features [20, 21]. Frequency-domain features, such as power spectral density (PSD), power, and energy, were often utilized in existing studies [2224]. Pan et al. [25] used the emotion-related common spatial pattern and differential entropy features of five frequency bands in their research. To acquire the time-varying characteristics reflected by EEG frequency data, the short-time Fourier transform was used to analyze EEG signals [26, 27]. As EEG signals are time-varying, researchers proposed new methods to obtain additional information by combining time and frequency-domain features. The Hilbert–Huang transform and discrete wavelet transform (DWT) study EEG signals from both the time and frequency domains. Hadjidimitriou and Hadjileontiadis [28] used the EEG features extracted from the Hilbert–Huang spectrum method for recognition emotions. Yohanes et al. [29] proposed the use of DWT coefficients as features for emotion identification from EEG signals. These results show that the EEG time-frequency features can provide salient information related to emotional states.

Many machine learning methods have been applied to emotion recognition tasks. Frantzidis et al. [17] obtained the amplitudes and latencies of ERP components and the event-related oscillation amplitude features and then employed a support vector machine (SVM) as the classifier. Murugappan et al. [15] proposed processing linear and nonlinear statistical features derived from five frequency bands (delta, theta, alpha, beta, and gamma) into DWT and k-nearest neighbor (k-NN) classifiers. Chun et al. [30] applied spectral power features and a Bayes classifier with a weighted-log-posterior function for emotion recognition. Various deep learning methods were also applied to EEG-based emotion recognition. Wang and Shang [31] presented an emotion recognition method based on deep belief networks (DBNs). Kwon et al. [32] used fusion features extracted from EEG signals and galvanic skin response signals and convolution neural networks (CNNs) for emotion recognition. In our previous work, an emotion recognition method based on an improved deep belief network with glia chains (DBN-GC) and multiple domain EEG features was proposed [23]. Suwicha et al. [33] utilized fast Fourier Transform to calculate the power spectral density features from EEG signals and proposed a deep learning network based on a stacked autoencoder. Li et al. [34] used the CNNs and recurrent neural network to construct a hybrid deep learning model.

Compared with traditional machine learning methods, deep learning models have shown excellent performance and the potential ability for multichannel EEG-based emotion recognition tasks [3136]. However, two challenges remain in multichannel EEG-based emotion recognition. First, most studies only consider the salient information related to the emotional states of each channel EEG signal in the time domain, frequency domain, and time-frequency domain. The spatial characteristics and global synchronization changes for all channels’ EEG signals under different emotion states were neglected, and these may provide salient information related to emotion states and so are beneficial to emotion recognition. Second, using a simple and effective deep model to mine and utilize the two salient information contributions for emotion recognition is important.

To address these challenges, a feature extraction method based on a global synchronization measurement from all EEG channels and a principal component analysis network- (PCANet-) based deep learning model is proposed. First, the maximal information coefficient (MIC) for each EEG channel pair is measured by a synchronization dynamics method. Subsequently, the MIC values of all channel pairs are arranged according to the given electrode order to construct the MIC features matrix, which is represented by a MIC gray image. Next, a novel deep learning model containing two PCA convolutional layers and a nonlinear transformation operation is designed for extracting the spatial characteristics and global synchronization features from the MIC gray image as the high-level abstract features. Finally, these high-level features are input to linear support vector machines to perform emotion recognition tasks. Synchronization analyses are extensively investigated in the neuroscience community, and synchronization measurements of EEG can characterize the underlying brain dynamics effectively [3739]. Synchronization patterns of EEG signals change with changing emotion states. MIC is considered the best bivariate synchronization measurement method [40] that can find synchronization patterns related to emotion states in EEG. Compared with traditional deep learning methods for emotion recognition [31, 35, 36], the PCA network achieves similar or better emotion recognition performance with low computational complexity. Additionally, the PCA network effectively learns robust invariance features from the MIC gray images, and filter learning does not require regularized parameters and a numerical optimization solver [41].

The remainder of this paper is organized as follows. The emotion dataset and model, MIC-based feature extraction method, and PCANet-based emotion recognition method are introduced in Section 2. The experimental results are described in Section 3. Section 4 provides a discussion. Section 5 presents a brief conclusion of this work.

2. Materials and Methods

2.1. Dataset and Emotion Model

This research adopts the DEAP dataset [42] to evaluate the proposed emotion recognition methods. DEAP is a public dataset for emotion analysis that includes electroencephalogram, physiological, and video signals. DEAP recorded EEG and peripheral physiological signals of 32 subjects (including 16 females and 16 males, aged 19 to 37 years with an average of 26.9) while watching 40 one-minute music videos as stimuli. These videos were chosen among 120 YouTube videos, and each was selected to activate a related emotion state. Each subject conducted 40 trials, so there are 1,280 trials (32 subjects  40 trials) in the dataset. The EEG signals in each trial included 3 s baseline signals and 60 s stimulation signals. For each trial, 32 EEG channels (Fp1, AF3, F7, F3, FC5, FC1, T7, C3, CP5, CP1, P7, P3, PO3, Pz, O1, Oz, Fp2, AF4, Fz, FC2, Cz, C4, T8, CP6, CP2, F4, F8, FC6, P4, P8, O2, and PO4) were used. Each subject was asked to complete the self-assessment manikin (SAM) for valence, arousal, dominance, and liking after each trial, presented in Figure 1, where the scale of the four self-assessment dimensions range from 1 to 9. The preprocessed raw EEG data and the corresponding emotion self-assessment in the DEAP dataset are used in this work. The EEG signals (512 Hz) were processed to remove the effects of the electrooculogram (EOG) and then downsampled to 128 Hz. Band-pass filtering was implemented with cut-off frequencies of 4.0 and 45.0 Hz.

In this study, the arousal-valence scale is used for emotion analysis because it can measure emotions effectively and is widely used in similar research. As shown in Figure 2, a two-dimensional emotional model can be constructed. The arousal ranges from inactive (e.g., bored, uninterested) to active (e.g., excited, alert) and valence ranges from negative (e.g., stressed, sad) to positive (e.g., elated, happy).

We define two emotion classes for valence and arousal scales. For each trial in the dataset, if the associated self-assessment value of the arousal is greater than five, then the trial is assigned to the high arousal (HA) emotion class. Otherwise, the trial is assigned to the low arousal (LA) emotion class. Similarly, there are low valence (LV) and high valence (HV) emotion classes in the valence dimension.

2.2. MIC-Based Deep Feature Learning Framework for Emotion Recognition

Considering the need for the performance of recognition and efficiency of analysis, the synchronization between EEG signals using a simple and effective deep feature learning method is analyzed in this study. The phases of our proposed method consist of feature extraction of the synchronization dynamics followed by high-level feature learning and pattern classification upon the linear SVM, as illustrated in Figure 3. The 32-channel EEG signals are segmented with the same window size of 12 seconds in the experiments. All MIC measurements of each channel pair within each time window are calculated and organized according to the electrode arrangement rules of the MIC gray image. These images are then processed and classified by the deep learning architecture based on a PCA network.

2.3. Synchronization Measurement Based on the Maximal Information Coefficient

The brain is a large network of neurons, and synchronous activities of neurons in different regions can provide useful information regarding the neural activity of interest. Relationships between brain regions can be described as synchronization between electrodes. Then, a connectivity matrix can be defined with elements representing synchronization information between two electrodes. The order of the electrodes is also essential because the spatial features of brain regions in the connectivity matrix can be retained. Therefore, this work uses new methods to explore the synchronization and spatial features of brain regions as they relate to emotion states.

2.3.1. Maximal Information Coefficient

The MIC is calculated by a synchronization measurement method that is used to measure the linear and nonlinear synchronization relationship between two random variables (e.g., bivariate EEG signal segments) [40]. This concept originates from maximal information-based nonparametric exploration statistics.

Specifically, according to [40], for the finite set S of ordered pairs, the x-value and y-value of S are partitioned into a grids and b grids (empty grids allowed), respectively, to form an a-by-b grid represented by G, and the maximal mutual information of each grid partition is assigned to by equation (1):where represents the mutual information [43] of distribution and denotes the partitioned grid. The characteristic matrix of S can be expressed using equation (2) as

The value of MIC is obtained using equation (3) where the number of samples is n, and the maximum number of partitioned grids is less than , where in this paper. The value of each element in the characteristic matrix is between 0 and 1. If the order of two variables is changed, then the value of MIC remains unchanged.

The activation level of neurons in brain regions is not synchronous under different emotion states. The MIC is a nonlinear bivariate synchronization measurement method, whereas EEG is nonstationary and provides nonlinear physiological signals. Theoretically, MIC can effectively measure the synchronization features between brain regions. Therefore, using MIC to measure the synchronization of physiological signals (EEG) reflected by different brain regions can obtain salient information related to corresponding emotion states.

2.3.2. MIC Gray Image

Studies have shown that the global information and spatial features of all EEG channels can improve the performance of emotion recognition [13, 24]. To characterize the spatial information and global synchronization information of all EEG channels, an MIC gray image is constructed as the feature for all EEG channels. The construction process of an MIC gray image is shown in Figure 4.

The MIC of each EEG channel pairs is measured for each sample. Assuming the sample has c channels, each EEG channel pair requires a MIC measurement, and there are measurements. In this work, the value of c is set to 32, so 496 MIC features are obtained for each sample.

When constructing the feature matrix, the arrangement order of the EEG channels is determined by the arrangement rules. According to the arrangement rules of the electrodes (Fp1, AF3, F7, F3, FC1, FC5, T7, C3, CP1, CP5, P7, P3, Pz, PO3, O1, Oz, O2, PO4, P4, P8, CP6, CP2, C4, T8, FC6, FC2, F4, F8, AF4, p2, Fz and Cz), the MIC values for all EEG channel pairs are combined to construct a MIC feature matrix (MICFM), represented as U, as seen in the following equation:where represents the MIC of the synchronization measurement between the EEG channels i and j. MICFM is a symmetric matrix, and when , . The MICFM is constructed from each sample, as shown in part b of Figure 4, and each element in the matrix represents the MIC of the corresponding EEG channel pair. For example, the red elements in the matrix of Figure 4(b) represent the MIC of channel AF3 and Fp1.

To enhance the ability of the feature representation and extract high-level features easily, the feature matrix is represented by a MIC gray image. As shown in Figure 4(c), the value of each pixel is the MIC of the corresponding EEG channel pair.

In brain regions, EEG signals of two physically adjacent electrodes tend to be similar due to the volume conductance effect. Therefore, the electrode arrangement rules retain the similarity information of the same brain region as well as difference information between different brain regions. The MIC gray image shows the synchronization variations of human EEG signals on the scalp directly and accurately. Also, compared with traditional features, the MIC gray image contains the spatial characteristics and global synchronization features of multichannel EEG signals, which is beneficial to identify emotion states. Following these advantages, the MIC gray image of each sample is extracted for further analysis.

2.4. PCA Network for Deep Feature Learning and Emotion Recognition

To utilize the spatial characteristics and global synchronization features, a deep learning model is introduced to extract high-level features from MIC gray images for emotion recognition. The proposed deep model is based on a PCA network [41], PCANet, which consists of a hierarchical feature learning layer and a linear SVM classifier. The structure of the model is shown in Figure 5. First, convolution filters in the feature learning layer are learned based on the input MIC gray images through PCA. The local patterns and the patterns of neighboring values of the MIC gray images are extracted by PCA convolution filters. In CNNs, convolutional filters are initialized randomly and directly determine the features of learning. The primary convolution filters of this PCANet-based deep learning model, which is different from those of CNNs, are generated by PCA to learn more discriminative features with a simple architecture. These discriminative features can effectively represent different synchronization patterns from EEG signals in various emotion states. Similar to CNNs, the first part of the PCANet-based deep learning model can be comprised of multilayer PCA filters, and this work only includes two PCA convolution layers.

Second, a nonlinear processing layer that includes binarization and hash mapping processing enhances the separation of the discriminative features. Then, block-wise histograms are used to reduce the dimension of features. Finally, the SVM classifier outputs the emotion recognition results based on the learned high-level features from the input images.

Suppose there exist N input MIC gray images of size , and the size of the patch in all convolution layers is k1 × k2. Only the PCA convolution filters need to be learned from the input MIC gray images . The components of the model structure are described in detail in the following sections.

2.4.1. First PCA Convolution Layer

For each MIC gray image in the training set, a k1 × k2 patch is applied around each pixel. All patches can be obtained from the i-th image, i.e., , where represents the j-th vectorized patch in the i-th image Ij. Then, can be achieved by subtracting the patch mean from each patch where represents a mean-removed patch. The same matrix is constructed from putting together each MIC gray image from the training set. Thus, equation (5) is obtained as

Assuming that the number of filters in the i-th layer is , the PCA minimizes the reconstruction error within a set of orthonormal filters, as expressed in the following equation:where is an identity matrix and the principal eigenvectors of is the solution. Thus, the PCA filters can be defined aswhere the l-th principal eigenvector of is represented as , which is mapped to by the function . Here, is a function that maps to a matrix . The variation of all the mean-removed training patches is captured by the dominating principal eigenvectors. The output of the first convolutional layer is expressed as where the convolution operation of two dimensions is denoted by . The output of the i-th input image is , and is the l-th filter of the PCA in the first convolutional layer.

2.4.2. Second PCA Convolution Layer

The operations of the second convolutional layer are similar to that of the first convolutional layer where all the patches of can be collected. After subtracting the patch mean from each patch and collecting the mean-removed patches for all the filter outputs, the following equation is obtained:where represents the l-th filter output of the first layer after the patch mean and mean are removed from each patch. The PCA filters of the higher layers are denoted as

For each input of the h-th layer convolving with for , the output O will have images of size , such that

For the original sample , additional PCA convolution layers can be built by repeating this process. Here, D inputs produce D sets of outputs , and each set contains images where . Therefore, the number of output images from all networks is .

2.4.3. Nonlinear Processing Layer

Following the previous stages, the output feature maps from the input MIC gray images are obtained and binarized in a nonlinear processing layer. The outputs in the h-th layer have D () sets, and each set has outputs . The outputs are binarized using a Heaviside step function with a value of one for positive entries and zero otherwise, resulting in equation (11). Around each pixel, the vector of binary bits is viewed as a decimal number, so this maps the outputs from back into a single integer-valued “image”, such thatwhere every pixel value is an integer ranging from 0 to .

To reduce the dimension of features, a block-based histogram is next applied. Each “image” is partitioned into B blocks. Then, the histogram of the decimal values in each block is computed so that all B histograms are concatenated into a single vector, represented as . Thus, the “high-level features” of the input MIC gray image are defined to be the set of block-wise histograms, such that

This block-wise histogram encodes special information and offers some degree of translation invariance in the obtained features within each block. The block size and overlapping ratio of local blocks are important parameters of PCANet-based deep learning models and are discussed in the next section.

After the high-level EEG features of the MIC gray images are learned by the process described above. An SVM based on a linear kernel is introduced to process the extracted high-level features and perform the emotion classification tasks.

3. Results

When a deep learning model is employed for emotion recognition, adequate data is essential to achieve meaningful performance. In this work, we augment the available training dataset through a temporal segmentation method. A 3-second pretrial baseline is removed in the first stage. Then, a sliding window divides the raw EEG signal of each channel into several segments.

We found that when the duration of the sliding window is too short, the MIC features of the samples with the same label are quite different, which leads to poor recognition performance. On the other hand, when the duration of the sliding window is too long, the number of samples is insufficient. Therefore, the duration of each sliding window is set to 12 s, and the segments are nonoverlapping to avoid the intratrial redundant information for training classifiers. Each segment within the same period for all channels in one trial is considered an independent sample, and the five new samples inherited the labels of the original. The number of samples per subject increased to 200 (40 trials × 5 sections), and the number of samples of all subjects expanded to 6,400 (200 samples × 32 subjects). All experiments are carried out within the two emotion dimensions of arousal and valence.

3.1. Comparison between Global MIC Features and Common Features of EEG Signals

First, global MIC features are compared with traditional time-domain and frequency-domain features of EEG signals under commonly used classifiers. The time-domain features include mean, variance, standard deviation, the first difference, the second difference, and approximate entropy with 192 (6 features × 32 channels) values for each sample. The frequency-domain features include the average PSD values (5 power × 32 EEG channels) of all EEG channels in theta (4–8 Hz), slow-alpha (8–10 Hz), alpha (8–12 Hz), beta (12–30 Hz), and gamma (30–45 Hz) bands. In addition, the frequency-domain features also include the difference of average PSD (4 power differences × 14 channel pairs) in theta, alpha, beta, and gamma bands for 14 EEG channel pairs (Fp2-Fp1, AF4-AF3, F4-F3, F8-F7, FC6-FC5, FC2-FC1, C4-C3, T8-T7, CP6-CP5, CP2-CP1, P4-P3, P8-P7, PO4-PO3, and O2-O1) between the right and left scalps.

To obtain convincing results, the dimensions of these two types of EEG features should be similar. Previous studies have also shown that the recognition performance of fusion features is better than that of a single feature [23, 36]. Therefore, we fuse the time-domain with the frequency-domain features of all EEG channels to form fusion features. The dimensions of fusion features and global MIC features are 408 and 496, respectively. Because increasing the dimensions of features may cause poor recognition performance, these time-domain and frequency-domain features are also compared with the global MIC features.

Several machine learning methods are used to perform emotion recognition tasks by using these four types of features. Classifiers include linear SVM (SVM-1), RBF kernel SVM (SVM-2), K-nearest neighbors (neighbors = 5), logistic regression (LR), and artificial neural network (ANN) (activation = relu, alpha = 1e 05, batch_size = auto, beta_1 = 0.9, beta_2 = 0.999, learning_rate_init = 0.001, max_iter = 200). The ANN consists of two hidden layers, each with 100 nodes, and are all implemented by the Scikit-learn toolkit [44]. For each classifier, parameters not mentioned above use default values. All experiments are conducted in two emotion dimensions (arousal and valence) by using a 10-fold cross-validation technique, and samples of all subjects are used. To achieve the best recognition result, the fusion features, time-domain features, and frequency-domain features are normalized into 0 to 1 before they are input to the SVM-1, SVM-2, k-NN, and ANN classifiers and are not normalized for LR classifiers.

The recognition results of the four types of EEG features under different classifiers are shown in Figure 6 and suggest that the global MIC features achieved satisfactory emotion recognition performance. To further illustrate the effectiveness of the global MIC features in emotion recognition, the experimental results are compared in detail. First, in the arousal dimension, the average recognition accuracies of the global MIC features are 0.6797, 0.6769, 0.6806, 0.6383, and 0.6827, respectively, under the five classifiers (LR, k-NN, SVM-1, SVM-2, and ANN), as detailed in Figure 7. Compared with the fusion features, the recognition accuracies of the five classifiers with the global MIC features are improved by 5.61%, 7.55%, 7.00%, 3.84%, and 5.94%, respectively. Compared with the frequency-domain features, the recognition accuracies of the five classifiers with the global MIC features are improved by 7.31%, 8.65%, 10.50%, 2.69%, and 5.81%, respectively. Compared with the time-domain features, the recognition accuracies of the five classifiers with the global MIC features are improved by 3.98%, 4.72%, 11.15%, 4.62%, and 2.48%, respectively. These results suggest that the recognition performance of the global MIC features in the arousal emotion dimensions is better than that of the other three types of features under all comparison classifiers.

The recognition performance of the global MIC features in the valence dimension has average recognition accuracies of 0.6638, 0.6617, 0.6598, 0.6283, and 0.6769 with the five classifiers LR, k-NN, SVM-1, SVM-2, and ANN, respectively, as shown in Figure 8. Compared with the fusion features, these recognition accuracies of the five classifiers using global MIC features improve by 4.44%, 10.49%, 10.39%, 8.85%, and 7.39%, respectively. Compared with the frequency-domain features, the recognition accuracies of the five classifiers using global MIC features improve by 5.99%, 13.17%, 10.25%, 9.90%, and 13.46%, respectively. Compared with the time-domain features, the recognition accuracies of the five classifiers using global MIC features improve by 2.31%, 5.64%, 11.77%, 8.42%, and 4.40%, respectively. These results suggest that the recognition performance of the global MIC features in the valence emotion dimension is better than the other types of features for all classifiers.

These results demonstrate that the global MIC features effectively characterize the differences of EEG signals in different emotion states, and the global synchronization information of EEG signals helps recognize emotion states. The MIC features based on the synchronization measurement offers new ideas for feature extraction in emotion recognition using EEG signals.

3.2. Emotion Recognition Using High-Level Features Based on MIC Gray Images

Second, the high-level features extracted from MIC gray images by a PCANet-based model are leveraged for emotion recognition tasks. For each sample, a corresponding MIC gray image is constructed providing a total of 6,400 images from all subjects used. The 10-fold cross-validation technique is also used, and to verify the effectiveness of the proposed high-level features, different model parameters are set for the experiments. Parameters of the PCANet-based model include the filter number of each layer ( and represent the number of filters in the first layer and second layer, respectively), the filter size of each layer (the filter size in the first and second layers are defined as and , respectively), the block overlap ratio (BOR), and the block size in the nonlinear processing layer.

3.2.1. Impact of the Filters Number

Except for the number of filters, the other parameters initially remain unchanged. The block overlap ratio is set to 0.5, the filter size of two layers is , and the block size is 8 × 8. We alternate changing the number of filters in each layer while keeping the number of filters in other layers unchanged. For example, when we change the value of , the value of remains the same.

The results in Figure 9 are from our experiment in which we found that when the values of and are within a certain range, the impact on the recognition performance of the two emotion dimensions changes significantly. Thus, the values of and are set to 1 through 21 and 7 through 15, respectively. Also, as shown in Figure 9, for any value takes the optimal interval of is from 9 to 17 in the arousal dimension and from 7 to 17 in the valence dimension. The recognition performance improves with increasing . However, when the value of increases to 15, the recognition accuracy increases negligibly. As the number of filters increases, computing resources and memory requirements increase dramatically. Therefore, it is appropriate to set the value of to 15.

The highest accuracy in the arousal dimension of 0.7130 was achieved from a combination of  = 11 and  = 15. The highest accuracy in the valence dimension of 0.6958 was achieved from a combination of  = 9 and  = 15. In addition, these results also show that with the increase of and , the impact of and on emotion recognition decreases gradually. Moreover, impacts emotion recognition more significantly than in two dimensions.

3.2.2. Impact of the Filters Size

The optimum combination of the number of filters is obtained from the previous section. Under these parameters, the impact of the filter size may be investigated. Specifically, the filter size of the second layer is set to 5, 7, 9, and 11, respectively, and the filter size of the first layer increases from 3 to 15 with an interval of 2. The block overlap ratio is set to 0.5, and the size of the block is 8 × 8. The recognition results are shown in Figure 10.

The filter size refers to the input image size, and we select in our experiment the appropriate ranges ( from 3 to 15, from 5 to 11) to show the influence of and on emotion recognition. As shown in Figure 10, with an increase in , the recognition accuracy rises rapidly. However, when the value of continues to increase, the recognition accuracy decreases slowly. In addition, with the increase of , the recognition performance of the model shows a downward trend. Whatever value takes, the optimal value of is 7 for the arousal dimension and 5 in the valence dimension.

On the other hand, as the filter size decreases, the computational costs increase significantly, but the recognition performance does not change much. The highest recognition accuracy of 0.7169 in the arousal dimension is achieved by a combination of  = 7 and  = 5. The highest recognition accuracy of 0.6958 in the valence dimension is achieved with a combination of  = 5 and  = 5. These results also show that and have a certain impact on the performance of emotion recognition, where the impact of is more obvious.

3.2.3. Impact of the Block Overlap Ratio

After determining the optimum values of the filter size and number in the two emotion dimensions, the impact of the overlap ratio on emotion recognition is next considered. The block size is 8 × 8, and the block overlap ratio ranges from 0 to 0.9, so the impacts from all overlap ratios are investigated. The recognition results are shown in Table 1 for the different values of the block overlap ratio.

As shown in Table 1, with an increase in the block overlap rate, the recognition performance changes little, which may be due to two reasons. First, the regularity between pixels in the MIC gray images of each trial is not apparent. Second, the global MIC features of each participant are different. In the arousal dimension, the maximum recognition accuracy is only 1.77% higher than the minimum while the variance is only 0.2826. In the valence dimension, the maximum recognition accuracy is only 1.69% higher than the minimum while the variance is only 0.1656. These results suggest that the block overlap ratio offers no significant impact on the performance of recognition in the two dimensions when using MIC gray images. The best recognition accuracies of the two dimensions are achieved when the block overlap ratio is 0.2, with values of 0.7169 and 0.6968 for the arousal and valence dimensions, respectively. To reduce the dimension of the features without affecting the recognition accuracy, the overlap rate of the blocks is recommended to be set to 0.2 in the two dimensions.

3.2.4. Impact of the Block Size

Because the shape of the images is square, only square blocks are selected. The block size is set to 4 × 4, 5 × 5, 6 × 6, 7 × 7, 8 × 8, 9 × 9, 10 × 10, 11 × 11, 12 × 12, and 13 × 13, while the other parameters are set to the optimal combination obtained from the above experiments.

The results are shown in Table 2 where the block size refers to the input image size. When the block size is less than 4 and greater than 13, the recognition performance changes a little. Therefore, the values of the block sizes range from 4 × 4 through 13 × 13. Initially, with an increase in size, recognition accuracy gradually improves. As the size continues to increase, the recognition accuracy begins to decreases. Blocks can reduce the dimension of the features and offer some degree of translation invariance in the obtained features. Because of the complexity of the EEG signals, there may be various deformations in a MIC gray image. With the increase in the block size, the robustness of the model to various deformations strengthens, which leads to an increase in recognition accuracy as the block size increases.

When the block size is too large, the number of features obtained by the model is small, so the recognition accuracy is unsatisfactory. In the arousal dimension, the maximum recognition accuracy is 0.7185 when the block size is 6 × 6. In addition, the maximum recognition accuracy is 1.80% higher than the minimum, while the variance is 0.1776. In the valence dimension, the maximum recognition accuracy is 0.7021 when the block size is 5 × 5. The maximum is 1.53% higher than the minimum, and the variance is 0.1370. These results also demonstrate that the block size offers a slight impact on emotion recognition accuracy. At this point, all parameters of the PCA network are analyzed, and the best identification results (arousal: 0.7185, valence: 0.7021) and parameter settings are obtained.

3.3. Comparison between Global MIC Features and High-Level Features

To illustrate the advantages of the high-level features extracted from MIC gray images over the global MIC features, the recognition performance of these features is compared with the results shown in Table 3.

First, we compare the recognition accuracies of the two types of features using a linear SVM with the same parameters. In the arousal dimension, the average recognition accuracy of the high-level features is 3.79% greater than the global MIC features. In the valence dimension, the average recognition accuracy of the high-level features is 4.23% greater than the global MIC features.

Second, a Wilcoxon signed-rank test () analyzes the recognition performance of the high-level and MIC features in both dimensions. Here, a null hypothesis exists that the recognition performance is similar and is accepted if the p-value is larger than . The values of arousal and valence dimensions are 0.002 and 0.0039, respectively, meaning that the recognition performance of the high-level features is superior to the global MIC features. The results of these comparisons show that high-level features can improve the performance of emotion recognition in the two dimensions. The MIC gray images include the global synchronization features as well as the spatial characteristics, which contain salient information related to the emotion states making the recognition performance of the high-level features better than that of the global MIC features.

3.4. Comparison between CNN and PCA Network-Based Deep Learning Model

The components for constructing the PCA network-based deep learning model are basic and computationally efficient. To demonstrate its lightness and effectiveness, the PCA network-based deep learning model is compared with a traditional CNN. With 6,400 MIC gray images from all subjects, 10-fold cross-validation is used. To obtain convincing results, the number of convolution layers, number of filters, and filter sizes of the CNN are similar to those of the PCA network. The CNN includes two convolution layers with the first employing a 5 × 5 kernel with a stride size of 1, a ReLU activation function, and ten filters. The second convolutional layer features the same parameters as the first except for having 15 filters. The output layer is a softmax classifier, and the batch size and epochs are 120 and 500, respectively.

The average recognition accuracies of the CNN in the arousal and valence dimensions are 0.6907 and 0.6853, respectively. The recognition performance of the PCA network in two emotion dimensions is better than that of the CNN with average recognition accuracies of the PCA network (0.7185, 0.7021) being 4.03% and 2.45% higher than the CNN. In addition, the overall training time of the CNN is significantly longer than the PCA network. During an experiment, the training of the PCA network on 6,400 images of 32 × 32 pixel dimensions lasted for approximately seven minutes, while the CNN took about an hour. According to Chan et al. [41], the overall computational complexity of the two-layer PCA network in the training and testing phases can be verified as

From equation (13), the PCA network offers low computational complexity. Compared with the PCA network, the filters of CNN require a numerical optimization solver during the training phase, which significantly increases the computational complexity. Compared to the CNN, the PCA network-based deep learning model offers better emotion recognition performance with lower computational complexity. Therefore, it is suitable for emotion recognition using MIC gray images.

4. Discussion

This study investigates the feasibility of MIC and PCA network-based deep learning approaches, which have recently developed in big data relevance analyses and image classification methods. To consider multichannel interdependencies, data from all available channels are included in feature extraction by using the MIC and deep learning algorithm as opposed to the individual application of traditional time-frequency analysis of each channel.

4.1. Synchronization Dynamics Related to Emotions Expressed in EEG

Our first observation is that the emotion classification performances for all classifiers and emotion states are higher using MIC features compared to time-frequency features. Many studies have shown that brain regions often respond differently to various emotions. When an individual is in an emotional state related to avoidance motivation, such as disgust and fear, a clear activation in the right frontal lobe relative to the left frontal lobe occurs. In the emotional state related to proximity motivation, such as pleasure, the activation degree of the left frontal lobe relative to the right frontal lobe is high [45]. Therefore, the synchronization of brain regions in the corresponding emotion state can be used to represent salient information of emotion. The global MIC features might reveal different varieties of dynamic processes of perception arousal level and excitation, which might be reflected by the nonlinear mode measured by MIC. Our experiments demonstrate these features as the global synchronization measured by MIC are superior to those measured by traditional time-frequency analysis, indicating that MIC could capture a variety of potentially interesting relationships between paired brain regions that traditional time-frequency analysis cannot capture.

Emotion states can be represented by the physiological electrical signals reflected from the cerebral cortex with a representation in the range of 2D space. Preserving the spatial characteristics of multichannel EEG signals can enhance the separation of EEG features in different emotional states. Therefore, in this work, according to the electrode arrangement rules, the MIC gray image is constructed by the global MIC features. The MIC gray image represents the features closer to the real response of the brain, which may contain additional information gain related to emotion compared with traditional features. The experiments in this paper also suggest that the reserved spatial characteristics are beneficial to emotion recognition.

4.2. Advantages of Unsupervised Deep Neural Network

Second, the experimental results show that the high-level features based on the synchronization and spatial characteristics of multichannel EEG can improve the performance of emotion recognition. Neural networks successfully are used in many fields because of their high nonlinearity, self-adaptive weight adjustments, anti-interference, and self-adaptive feature selection. This research uses a PCA network, an unsupervised deep network model, to process MIC gray images, and this unsupervised deep learning model effectively captures the synchronization and spatial characteristic features of the MIC gray images.

By verifying the influence of different network parameters on the recognition performance, we found that the PCA network’s performance does not rely on the number of layers when processing MIC gray images. The number and size of the filters provide a greater impact on the recognition performance, while the impact of the block size and BOR is not obvious. These observations suggest that the difference in the synchronization dynamics between EEG channels is evident in different emotional states. On the other hand, compared with the traditional CNNs, the deep model based on the PCA network obtains better recognition performance in two emotion dimensions and has a lower computational complexity. In the PCA network-based deep model, PCA is employed to learn multistage filter banks that can automatically learn the features related to emotion of the MIC gray images, and this automatic feature extraction method is simple and effective. In this paper, the proposed method of using an unsupervised deep network to extract high-level features related to emotion in EEG contributes new ideas for follow-up research.

4.3. Advantages of Proposed Approach over Existing Methods

The results of our proposed method are also compared with those of other emotion recognition methods based on the same dataset. Table 4 presents the details of the comparison methods and emotion recognition results. Among them, features extracted from the central nervous system (CNS) were used in reference [42] and hierarchical bidirectional Gated Recurrent Unit network (H-ATT-BGRU) were used in reference [51].

As shown in Table 4, the performance of our proposed model outperforms most of the compared methods by achieving the highest recognition accuracy, with the exception of reference [48] in the arousal dimension. The reason for this scenario may be that the method used in reference [48] utilizes a subject-related emotion classification model that only classified samples belonging to the corresponding subject. The average of all subjects’ results was used as the final recognition result.

In contrast, this study uses samples from all subjects, based on the general EEG classification model fostered to detect emotional states of different subjects accurately. This study also shows that the change of synchronization between EEG channels can be used to represent the change of a person’s emotion. On the other hand, conventional emotion classification approaches rely on time and frequency analyses of EEG, which need sufficient a priori knowledge, where our method requires no a priori knowledge. Furthermore, the proposed approach is not required to remove intensive noise from the EEG while other available methods do.

4.4. Potential for EEG Data Applications

Using the MIC to find complex associations in EEG data and the physiological and psychological information represented by associations, such as emotion and disease, can now be further analyzed. This approach offers a new way to use EEG for related pattern classification and recognition tasks. In addition, a parallel MIC computing scheme can reduce the computing complexity of MIC [52] to enable real-time synchronization analysis of EEG data in real applications. The filter learning in PCANet does not involve regularized parameters or a numerical optimization solver. Moreover, the construction of the PCANet only includes a cascade linear map and a nonlinear output stage. Such simplicity offers an alternative and refreshing perspective to convolutional deep learning networks for processing EEG data. The overall work enables a general and cost-effective solution for the emotion classification of EEG and holds great potential for other classification tasks related to EEG, such as epileptic dementia and Alzheimer’s disease detection.

5. Conclusions

A novel feature extraction method based on synchronization dynamics and deep learning was proposed for multichannel EEG-based emotion recognition, including two primary tasks. First, a method based on synchronization dynamics is used to extract the global MIC features from all the channel pairs of the EEG signals, which are then represented by a MIC gray image according to the proposed feature construction method. The MIC gray image reflects the global synchronization information as well as the spatial characteristics in all EEG signals. Thus, the image contains the spatial and global synchronization features that provide salient information related to emotional states. Second, a PCA network-based deep learning model and a linear SVM classifier extract high-level features and emotion classification, respectively.

The experimental results suggest that the proposed feature extraction method achieved satisfactory results and proves that MIC features can automatically and effectively characterize salient information in EEG signals related to emotional states. In addition, this work demonstrates that the spatial and global synchronization features contained in the proposed MIC gray image are beneficial to recognize human emotion. The deep learning model based on the PCA network can effectively mine and utilize the two salient information dimensions for emotion recognition.

Data Availability

The dataset used in this paper is derived from the Queen Mary University of London (http://www.eecs.qmul.ac.uk/mmv/datasets/deap/).

Conflicts of Interest

The authors claim that there are no conflicts of interest in terms of the publication of this paper.

Acknowledgments

The authors would like to thank the support of the National Natural Science Foundation of China under grant no. 61502150, Fundamental Research Funds for the Universities of Henan Province under grant no. NSFRF1616, and Key Scientific Research Projects of Universities in Henan grant no. 19A520004.