Sleep disorder is a serious public health problem. Unobtrusive home sleep quality monitoring system can better open the way of sleep disorder-related diseases screening and health monitoring. In this work, a sleep stage classification algorithm based on multiscale residual convolutional neural network (MRCNN) was proposed to detect the characteristics of electroencephalogram (EEG) signals detected by wearable systems and classify sleep stages. EEG signals were analyzed in each epoch of every 30 seconds, and then 5-class sleep stage classification, wake (W), rapid eye movement sleep (REM), and nonrapid eye movement sleep (NREM) including N1, N2, and N3 stages was outputted. Good results (accuracy rate of 92.06% and 91.13%, Cohen’s kappa of 0.7360 and 0.7001) were achieved with 5-fold cross-validation and independent subject cross-validation, respectively, which performed on European Data Format (EDF) dataset containing 197 whole-night polysomnographic sleep recordings. Compared with several representative deep learning methods, this method can easily obtain sleep stage information from single-channel EEG signals without specialized feature extraction, which is closer to clinical application. Experiments based on CinC2018 dataset also proved that the method has a good performance on large dataset and can provide support for sleep disorder-related diseases screening and health surveillance based on automatic sleep staging.

1. Introduction

Sleep plays an important role in human health [1]. The sleep monitoring of human has significant implications for medical research and practice [2]. Sleep specialists usually evaluate the quality of sleep by analyzing signals from sensors connected to different parts of body in accordance with the Rechtschaffen and Kales Rules [3] or the American Academy of Sleep Medicine (AASM) sleep score manual [4]. In particular, polysomnography (PSG), which records EEGs, electrooculograms (EOG), electrocardiograph (ECG), electromyography (EMG), respiratory effort, leg movement, and blood oxygen saturation over several nights in a sleep laboratory, is considered as a gold standard for evaluating sleep status of subjects [5]. In order to improve the efficiency of sleep monitoring, several effective sleep staging methods based on EEG, ECG, and EMG signals have been proposed in recent years [6]. However, wearing too many sensors during sleep is obtrusive and uncomfortable, the silver/silver chloride electrodes with certain adhesive or conductive paste the signal acquisition are adopted mostly, and the placement of them is demanded carefully in hairy regions of scalp to minimize movement-related noise, which affects the natural sleep of subjects and is not suitable for long-term sleep monitoring in home environment.

In recent years, noninvasive [7, 8] or noncontact [9, 10] measurement of cardiac, respiratory, and body movement signals which offers the potential of low cost and easy operation for long term dynamic sleep monitoring has gradually gained the favour of researchers. But its performances badly depend on the quality of signal acquisition and complex signal processing. EEG recordings play a crucial role in the classification of sleep stages. In order to solve the problem of wearers’ comfort, classification of sleep stages based on single-channel EEG signals has been extensively investigated [11, 12], because compared to complex PSG devices, the corresponding dual-electrode device has the advantages of wearable and less interference.

Compared with frontal electrodes, EEG recordings from central, occipital, and parietal electrodes could better detect sleep spindles, vertex waveforms, and rhythms [13]. However, channel selection is important for measurement convenience. Fp1, Fp2, and Fpz electrodes below hairline are suitable for wearing and data collection, so do F3 and F4 electrodes. Researchers have tried to address this problem by extracting the same relevant information from frontal electrodes [14]. But the information which can support for W, N1, N2, and N3 stages is different from that extracted from the parietal lobe, so it can affect the credibility of staging results.

The performance of most sleep stage classification methods relies on the selection of representative features for different sleep stages. Frequency domain [15], time-domain, and time-frequency domain [16] decompositions are the common steps for processing time signals and extracting features directly, and various mathematical models have been established in the process of discovering hidden features. After feature extraction, various machine learning algorithms are used for classification [17, 18], such as linear discriminant analysis, nearest neighbour classifier, decision tree, support vector machine, random forest, and ensemble learning. It also shows the good results with combinatorial machine learning models [19]. In recent years, deep learning methods such as convolutional neural network (CNN) [20], recurrent neural network, and other deep neural networks have become common in pattern recognition in biomedical signal processing. In [21], long short-term memory (LSTM) model which takes advantage of sequential data learning to optimize classification performance was proposed for automated sleep stage. Since feature-based approaches may not be suitable for a comprehensive description of subject heterogeneity, CNNs were also applied to learn multiple filters to extract time-invariant features from raw EEG channel [22].

To solve both the subject heterogeneity and temporal pattern recognition problems, the combination of CNN and LSTM has shown a good performance for the usage of precomputed spectrograms in [23]. However, most representative deep learning models rely heavily on hyperparameter tuning, which is challenging to extend. Although some studies such as [24] consider the temporal context, training must be divided into pretraining and fine-tuning. In addition, since the learning rate is set to a very low value during fine-tuning, it takes more time to reach the optimum or may not even be optimal. Therefore, the computation cannot be performed in parallel, which also prolongs the training process.

Increasing the network layer number of CNNs can improve the extraction effect of signal features. Multiscale CNNs were also proposed to perform multiscale feature extraction and classification simultaneously [25]. However, gradient dispersion or gradient explosion is likely to occur when the depth of CNNs increases. The residual network (ResNet) proposed by He et al. [26] was aimed to solve the degeneration problem of network. The method was applied to the machine fault detection and achieved good results [27]. The sleep stage classification based on residual-based attention model was also adopted in [28], but only -fold cross validation not subject cross-validation was performed, and in the meantime, the amount of test data was not enough.

Considering that multiscale convolutional neural network can capture the detailed signal features required for pattern classification, the idea was adopted to realize a wearable smart eye-mask in our prior study [29]. This method uses single-channel original EEG signals and omits the process of special feature extraction. It has good performance and application potential and can provide support for clinical applications such as screening and diagnosis of sleep disorders. The main contributions of this work are as follows: (1)A deep learning architecture of ResNet with different filter sizes was developed. Time-invariant features from original single-channel EEG signals can be extracted by training learning filters, so it can save the time of features computation, while residual can be trained to encode temporal information such as sleep phase transition rules into the model(2)A training algorithm that can effectively train the model end-to-end was developed. Single-channel EEG from forehead was adapted to reduce the patients impact and increase usability(3)The proposed approach is evaluated on two publicly available datasets: sleep-EDF-expanded [30] and CinC2018 [31] through subject cross-validation experiment with a strong robustness performance. The results are compared with state-of-the-art results in the field of sleep stage classification to demonstrate the superiority of our method, and it solves the problem that too little data may lead to poor generalization effect of sleep stage model

2. Methods

Sleep staging is a problem of recognition or regression of time series signals. According to the sleep manual, experts divided the EEG, EOG, EMG, and other signals obtained by sensors at intervals of 30 s into W, REM, N1, N2, and N3. In this paper, the multiscale feature learning was integrated into a residual network (ResNet) to automatically learn sleep features of original physiological signals at different time scales, and classification in a parallel way from complex original physiological signals of sleep was achieved.

2.1. Residual Block

In general, the more layers in convolutional neural network, the more diverse features can be extracted. But previous experiments show that there exists a degradation problem in deep networks: when the network depth of network increases, accuracy saturates and then degrades, and the addition of more layers can lead to an even higher training error, which is not caused by overfitting.

Although the CNN of dozens in layers can be trained by normalized initialization and batch normalization (BN), it is prone to degenerate as the number of layers increases. Theoretically, if the additional layer of a certain layer in the deep network does not learn anything, but just copies the features of last layer, it is called identity mapping, and the training error should not increase. A deep residual learning framework was proposed to solve the degradation phenomenon [26]: if an identity mapping was optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.

For the network with an input , the learning feature is denoted as , expected network learning residuals . Because residual learning is easier than the traditional feature learning, the residual learning adopts every few stacked layers, as shown in Figure 1, a residual block is formed by neural network with shortcuts connections. It contains two kinds of mappings: one is the identity mapping, which is the shortcut curve in the graph, and the other is the residual mapping. If the residual error is not equal to 0, the network performance can still be improved by increasing the number of network layers. If the residual is 0, the current layer is just an identity mapping which neither improves nor degrades, thus, the network degradation problem can be solved.

The output from the shortcut operation is given as the following equation. where and are the input and output vectors of the layer under consideration, is the residual mapping function to be learned, and it can be expressed by the following equation. where is the rectified linear unit (ReLU) activation function.

2.2. MRCNN

In order to extract features from different receiving domains, we used three ResNet pathways constructed by multiple ResNet units as shown in Figure 2, and time series signal fragments were directly used as inputs of the network. Each path contained four ResNet units with two convolutional layers and a shortcut. Each convolution layer was followed by the BN and the activation function ReLU. The solid line shortcut means that it can be added data directly, dashed line shortcuts indicated that they need to be added the convolution (Conv) to the same dimension, and the result of each path was averaged by pooling 512 features. Each ResNet block had a different convolution kernel when the core size was set to , , and .

According to the above description, the expression of ResNet unit constructed can be expressed by the following equations.

where is convolution operator, is used to improve generalization capability.

CNN is a unique artificial neural network inspired by the cerebral cortex, which uses convolution method to extract signal features and compress signal size. In order to retain valid information, it reduces the amount of input data by several orders of magnitude, which can form an optimal network and reduce the risk of network overfitting. CNN contains two core layers: convolution and pooling. The purpose of convolution is to extract features at different levels from original data, and a certain number of filters are used to extract feature maps of input data. The pooling layer is periodically inserted between CNN’s convolutional layers, and a subsampling operation is used to reduce the number of parameters. At the end of the algorithm, the activation function which can significantly improve the training speed is used.

In a CNN, receptor field is defined as the size of the region mapped on the input image by pixel points on the output feature map in each layer of CNN. In layman’s terms, this is the area where the input feature is “seen” by the output feature point. The receptive field is calculated by the following equation. where is the receptive field of the -th convolutional layer, is the receptive field of the -th layer, is the convolution step size of the -th layer, and is the size of the convolution kernel of -th layer.

The EEG signals input to the network are processed by the convolution layer with a convolution kernel length of 15 and transmitted to the BN and ReLU activation functions for maximum pooling. Then, the output data was sent to three channels of different sizes of the convolution kernel for calculation. Finally, the characteristics of three channels were combined and connected to the full connection layer with 1536 neurons, and the network classification results were obtained by the softmax function.

The receptive field of each convolution layer can be calculated according to equation (1). The receptive field of the output characteristic graph of the last convolution layer in the network was 563 in the input data. EEG signals had a sampling frequency of 100 Hz, and the effective frequency resolution of 563 data points was 0.35 Hz, which can meet the frequency resolution requirements of all rhythmic waves.

2.3. Network Training

In order to realize the loss calculation of multiclassification of sleep stage, cross entropy (CE) was used as the loss function, and its definition was shown as the following equation. where is one-dimensional array (the array consists of predicted probability values for each tag) after being processed by Softmax, is the actual label. represents the element in the array whose ordinal number is . is the actual label weight. Since the classification of sleep stages is an unbalanced classification task, this paper tried to balance differences in the amount of data of each label, and is defined as the following equation. where is the proportion of the label in total label.

For the loss function, the adaptive moment estimation solver (Adam) was used for optimization. Learning rate was set to , and all other parameters were set to default. Each time the network training was performed, the algorithm would be used to optimize all training set data for 20 times, and then the verification set would be used to obtain the system performance index results. The MRCNN was built by using PyTorch 1.0, and it was trained by using GTX950M with Ubuntu 18.04 system. Other hardware configurations include Intel Core i7-4710MQ, 12 GB RAM.

2.4. Performance Evaluation

Recall , accuracy (), and specificity were used to evaluate the results of sleep staging. Overall recall (), overall accuracy (), and overall specificity () were expressed by the following equations. where represent the true cases, true negative cases, false positive cases, and false negative cases formed by the classifier to judge the category, respectively. In these equations, represents five different stages of sleep. At the same time, the 𝑘𝑎𝑝𝑝𝑎 coefficient which could describe overall performance of the system would also be calculated as the following equation. where represents the real label of the sample , represents the predicted label of the sample of the model, presents the sum of the samples of each correct category divided by the total number of samples, , is the actual number of the sample , is the predicted number of the sample , and is the total number of samples.

3. Experiments

3.1. Datasets

In order to ensure the robustness and reproducibility of the results, two public datasets were conducted experiments. The first dataset used in the experiment was sleep-EDF-expanded and contains two different groups of subjects, named as Sleep Cassette (SC) group and Sleep Telemetry (ST) group, respectively. The annotation files included sleep stages W, REM, Stage 1 (S1), Stage 2 (S2), Stage 3 (S3), Stage 4 (S4), movement time (M), and UNKNOWN, and it consisted of a manual score by a skilled technician. Stage M and UNKNOWN were deleted for their extremely small percentage. At the same time, according to the latest AASM sleep scoring manual, S1 and S2 were corresponded to N1 and N2, respectively, and S3 and S4 were combined into N3. In this study, we cropped the SC files in the dataset, and only signals in the period from 30 minutes before the beginning of sleep to 30 minutes after the end of sleep were retained, data from FPz-Cz channels was used.

The second dataset used in the experiment was provided by the CinC2018, which contains 1,985 samples. The sleep stages of each sample were labeled by Massachusetts General Hospital clinical staff and divided into six stages: W, N1, N2, N3, REM, and Undefined. For research and application consideration, data from F4-M1 channels was selected here. The data was divided into a training set () and a test set (). We randomly selected 500 subjects from the training set as the dataset and deleted the undefined period with a small proportion. To adapt to the AASM sleep scoring manual, the EEG signals of datasets were divided into 30 s as an epoch. After processing, the sleep stage statistics of the two datasets are shown in Table 1.

3.2. Preprocessing

In order to adapt to different PSG devices and individual differences of subjects, EEG signals were normalized by using the 5-th and 95-th quantiles [32], as shown in the following equation. where is the result of signal normalization, is original signal, and and are the 5% and 95% largest value of the signals.

In order to expand the dataset and improve the network generalization ability, each input data should be randomly clipped (the 3000 data points in each epoch were randomly clipped to 2700), the flip probability was 50%, and 0.01 times random noise was added. Finally, the preprocessed data were integrated, and then added batch size and channel number, and it was converted to tensor data type, as shown in the following equation. where represented batch size, channel number, and data points of single epoch, respectively. In order to input the sleep stages corresponding to the EEG of each Epoch into the network, sleep stages were mapped as the following equation.

3.3. Performance Evaluation

The cross-validation used in this paper included -fold cross-validation and subject cross-validation. The former randomly divided the entire dataset into subsets with epoch as the smallest unit. Each subset was taken as the verification set, and the remaining subsets was taken as the training set. Experiments were performed for times, and the results of all verification sets were weighted and summarized to get the final result.

In order to evaluate the performance of proposed method, 5-fold cross-validation and subject cross-validation were completed on sleep-EDFx dataset, and data of 197 subjects from FPz-Cz channels was used. Subject cross-validation divided the dataset into training set and verification set with a partition ratio of 8 : 2, and statistical information is shown in Tables 2 and 3. In order to evaluate the effect in practical application, subject cross-validation was also completed on CinC2018 dataset, and the data of 500 subjects from F4-M1 channel was used.

4. Results and Discussion

4.1. Cross-Validation

The results of 5-fold cross-validation are shown in Table 4. The table contains the classification performance index of each sleep stage and the overall performance index of the original confusion matrix obtained after cross-validation. The overall recall rate, accuracy rate, specificity, and kappa coefficient are 78.94%, 92.06%, 95.13%, and 0.7360, respectively. It can be seen that the network proposed in this paper can provide good classification performance with high resolution for the period W, but poor resolution for N1. In particular, it is found by the confusion matrix that N1 is easily misjudged as N2 and REM, which is consistent with the results in [33]. This may be due to the relatively small proportion of sleep time in N1, resulting in less training data. In the meantime, N1 is a transitional period between waking state and sleep state and contains both and waves, which makes classification difficult.

The results of subject cross-validation on the dataset are shown in Table 5. The overall recall rate, accuracy rate, specificity, and kappa coefficient of classification are 75.81%, 91.13%, 94.65%, and 0.7001, respectively. Results compared with the existed sleep staging methods are shown in Table 6. It can maintain better system performance for the data of invisible subjects in the case of large amount of data and reach the similar effect of CNN-LSTM [21]. This cross-validation method can effectively prove the generalization ability of the system on unknown subjects and has high practical application value. In terms of N1 resolution, the recall rate of our proposed network for N1 is 60%, which is much more than other deep learning models.

In [29], several methods based on residual network were compared by using sleep-EDF-expanded dataset, and it can be seen that our proposed MRCNN performs better than other residual networks also in terms of N1 resolution. In this paper, ResNet18 [26] and MRCNN were compared in same circumstances. In order to adapt to the input of one-dimensional data, all two-dimensional layers in the network structure of ResNet18 were modified to one-dimensional layers. It can also be seen that our proposed MRCNN performs better than ResNet18 in terms of N1 resolution. From the data shown in Table 6, it can be also seen that the proposed network can provide poor resolution for N2. In particular in Table 5, it was found by the confusion matrix that N2 was easily misjudged as N1 or N3. This may be due to N2 is a transitional period between N1 and N3 and contains both sleep spindles and -complex waves, which makes classification difficult.

4.2. Performance Evaluation in Wearable Application

Especially, the channel used in the experiments is F4-M1, which is suitable for wearable application. Training set of 400 samples and test set of 100 samples from CinC2018 were used for subject cross-validation. The results of performance are shown in Table 7. Even though the performance of the system decreased a little when it applied to large amount of data, it still demonstrated the generalization ability of the system in the presence of unknown subjects.

4.3. Comparison of Automatic and Artificial Sleep Staging

Automated sleep staging was performed by using the trained model and compared with manual scoring results of expert, as shown in Figure 3, the automatic staging results are close to the manual staging results.

4.4. Automatic Extraction of EEG Features by MRCNN

The results of MRCNN’s extraction of effective EEG features are shown here. EEG data in W stage and N3 stage from CinC2018 dataset were fed into the trained network model, respectively, and then the output of the first convolution layer of the model was visualized. It contains 64 convolution kernels, corresponding to the output of 64 channels, and the final results are shown in Figure 4. It can be seen from the figure that for the same input signal, different convolution kernels have different outputs. In addition, by comparing the convolution layer output results of W stage and N3 stage, it can be found that there are significant differences in the output of the same convolution kernel of different input signals, which fully demonstrates that the convolutional neural network can automatically extract the features of EEG signals.

5. Conclusions

In this paper, a new sleep staging method based on multiscale residual network was proposed. It can automatically extract useful information from original single-channel EEG signals and classify sleep stages. By the cross-validation of datasets, the system performance can be maintained for the data of invisible subjects in the case of large data volume. Compared with other deep learning methods, our method only uses a single-channel EEG, and it does not require complex data preprocessing and specialized feature extraction processes to achieve better system performance, which provides the possibility for the clinical application of automated sleep staging. In addition, the multiscale residual network could be further deepened when the computing capacity was enough, and then, a larger amount of data could be used for training, so that the model with better robustness and system performance can be obtained theoretically.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.


The research was supported by the Science and Technology Program of Guangzhou (2019050001), Program for Chang Jiang Scholars and Innovative Research Teams in Universities (no. IRT_17R40), Guangdong Provincial Key Laboratory of Optical Information Materials and Technology (2017B030301007), Guangzhou Key Laboratory of Electronic Paper Displays Materials and Devices (201705030007), and MOE International Laboratory for Optical Information Technologies and the 111 Project. “A multiscale residual convolutional neural network for sleep staging based on single channel electroencephalography signals” as the preliminary study has been presented as a preprint according to the following link: https://www.researchsquare.com/article/rs-554671/v1.