Abstract

A novel multichannel dilated convolution neural network for improving the accuracy of human activity recognition is proposed. The proposed model utilizes the multichannel convolution structure with multiple kernels of various sizes to extract multiscale features of high-dimensional data of human activity during convolution operation and not to consider the use of the pooling layers that are used in the traditional convolution with dilated convolution. Its advantage is that the dilated convolution can first capture intrinsical sequence information by expanding the field of convolution kernel without increasing the parameter amount of the model. And then, the multichannel structure can be employed to extract multiscale gait features by forming multiple convolution paths. The open human activity recognition dataset is used to evaluate the effectiveness of our proposed model. The experimental results showed that our model achieves an accuracy of 95.49%, with the time to identify a single sample being approximately 0.34 ms on a low-end machine. These results demonstrate that our model is an efficient real-time HAR model, which can gain the representative features from sensor signals at low computation and is hopeful for the effective tool in practical applications.

1. Introduction

Human activity recognition (HAR) is a typical multiclassification problem, which acquires and analyzes human activity-related data to identify human activity status [1, 2]. It plays an essential role in people’s daily life and is widely used in the fields of safety, medical care, smart home, and entertainment. Specific applications include smart home [3, 4], gait analysis [5, 6], security certification [7, 8], health monitoring [9, 10], athlete monitoring [11], and gesture recognition [12, 13]. There are two main methods of human activity recognition: vision-based human activity recognition and sensor-based human activity recognition. Although the vision-based recognition method has been extensively studied and can achieve a high recognition rate, this method is limited by the high acquisition cost of the imaging device, and it is a challenge to collect the image data sometimes, so it is hard to meet the needs of the real-life environment. With the development of smartphones and wearable sensor technologies, smart devices with built-in sensors are characterized by low cost, convenient carrying, and good real-time performance. Therefore, HAR based on sensor signals has become the focus of research in this field.

HAR based on sensor signals includes two methods: the traditional method and the deep learning method. The traditional method based on sensor signal for HAR needs complex preprocessing of the raw data and relies on manual experience to extract the required time-domain features [1416], frequency-domain features [1619], and other features [20, 21]. These hand-craft features are shallow features, which would inevitably lose some implicit key features. Deep learning methods can make up for the shortcomings of traditional methods and can dig out automatically the more recognizable inherent features contained in the data by learning the deep nonlinear network structure.

Deep learning is well known as a revolution in machine learning, especially in the field of computer vision [22, 23] and natural language processing [24, 25]. In recent years, different deep learning methods have been proposed for human activity recognition based on sensor signals, including autoencoders [26], fully connected deep neural network (DNN) [27, 28], recurrent neural network (RNN), convolutional neural networks (CNN), and the hybrid deep learning model. RNN, CNN, and hybrid models are the most widely studied in HAR, and we will introduce them in detail in the second section. RNN, especially Long Short-Term Memory (LSTM), can capture the dynamic time dependence of various motions and helps to explore the pattern features [2]. However, the LSTM takes a longer training time due to numerous parameters that need to be updated during the training process. Compared with RNN, CNN is more able to learn the crucial features contained in recursive patterns [1, 29]. However, most CNNs have a single parameter setting in the convolution process, which dramatically limits the flexibility of the model. Besides, a larger convolution kernel can help to capture more information but increases computation cost for CNN. The application of dilated convolution may be an effective solution, which achieves dilating the receptive field of the convolution kernel without increasing the kernel parameter numbers [23].

Real-time HAR is also a research hotspot in the field. Some methods for this have been proposed to implement this problem [30, 31]. The shortcoming of these works is that it is difficult to maintain the balance between activity recognition accuracy and running time. All of these challenges have led researchers to develop efficient recognition methods with high recognition accuracy and low computational complexity effectively solving these problems.

Based on current research deficiencies, this paper proposes a novel multichannel dilated convolution neural network (MDCNN). The model can get a larger receptive field to extract global features of long-time series from the raw sensor data by using dilated convolution rather than traditional convolution structure. Moreover, the proposed model uses multichannel block convolution operations with different kernel sizes to obtain combined features of multiscale. Through experimental comparison, the proposed model can effectively improve recognition accuracy and achieves real-time HAR effectively.

The rest of this paper is organized as follows: Section 2 provides related work concerning different deep learning methods for HAR. Section 3 describes the fundamentals of CNN, dilated convolution, and multichannel convolution. The framework and training process of the proposed model are introduced in Section 4. Section 5 conducts a series of experiments with the proposed model and discusses the results, while Section 6 gives the conclusion and presents our future work.

In recent years, various deep learning methods have been proposed for sensor-based HAR. RNN can retain memory and learn sequence data to capture the inherent relationships of time-series data. Chen et al. proposed the LSTM-based method that uses three-axis accelerometer data on the lab public datasets (WISDM) to identify human activities with an accuracy of 92.1% [32]. Guan and Plötz developed ensembles of deep LSTM, which combines sets of diverse LSTM learners into classifier collectives [33]. The experimental result on three standard benchmarks (Opportunity, PAMAP2, and Skoda) demonstrates that Ensembles of deep LSTM outperform individual LSTM networks. However, the deep LSTM takes a longer training time due to numerous parameters that need to be updated during the training process. For enhancing faster learning in early training, Zhao et al. proposed an improved LSTM model: Res-Bidir-LSTM, which also guarantees the validity of information transmission through residual connections and bidirectional cells [34]. The result shows that Res-Bidir-LSTM has increased by around 4% under the public domain UCI dataset and the Opportunity dataset in comparison with previous work. Hammerla et al. explored the three types of deep learning models (deep feed-forward networks (DNN), CNN, and RNN) on the three benchmark datasets in labs [35]. The results found that CNN has better performance than other models on prolonged activities like walking and running.

There are two advantages to CNN: local dependence and scale invariance [1, 36]. Local dependence means that the signal at the current time may be related to the signal around this point, and scale invariance refers to the fact that the research object does not change in the amplitude or frequency of the synchronization [37]. Zeng et al. proposed an original CNN model for accelerometer data, in which each axis of acceleration is input to a separate convolution layer and a pooling layer, respectively, to extract features [36]. However, due to the fact that the model only considering the acceleration data and model structure is too simple, it is tough to extract crucial features.

Ronao and Cho constructed three layers of CNN, automatically extract robust features from the raw data to raise the accuracy of HAR, and get the UCI dataset and WISDM dataset [38]. They further improve the performance of their model by using additional information from the fast Fourier transform (FFT) of the raw data. However, both the convolution and the pooling operations of the method are performed in a single channel; this single parameter setting dramatically limits the flexibility of the parameters, so the network is unable to extract efficient global and local features at multiple scales. Mohammad et al. presented the multiple CNN pipelines with the structure of late fusion and bypassing connections [39]. This model can comfortably accommodate multiple sensors and signal representations such as the time-domain data, FFT information, and spectrogram, achieving higher performance for six publicly available datasets. However, it is computationally expensive compared with earlier methods due to the employment of bypassing connections from all layers [39].

Besides single models, the hybrid deep learning model combines CNN and RNN which are also proposed in a few works. Ordóñez and Roggan proposed a generic deep framework for activity recognition based on convolutional and LSTM recurrent units [40], where CNN acts as a feature extractor and LSTM models the temporal dynamics of the extracted feature maps. However, this complex network framework suffered from low efficiency and can hardly meet real-time requirements in practice applications.

Most of the current research is carried out offline, and some works have realized HAR in real time. Inoue et al. proposed a deep recurrent neural network (DRNN) for HAR with a high recognition rate and a high throughput [30]. However, despite reducing the training time by parallel processing using the GPU, this network was still very large [41]. Cao et al. proposed a Group-based Context-aware human activity recognition (GCHAR) classification method to achieve HAR in real time, which used a hierarchical group-based classification scheme and context awareness to enhance the classification performance [31]. The result shows that training time and testing time are shorter than other comparison algorithms. The classification accuracy is 94.16%, which is slightly lower than the deep learning algorithm. Therefore, the core of the current work is to achieve a model with high recognition accuracy and low computational complexity.

3. Methodology

3.1. Convolutional Neural Network for HAR

CNN is a multilayered deep network structure consisting of the input layer, the convolutional layer, the pooling layer, the fully connected layer, and the output layer. Among these layers, the alternating convolutional and pooling layers constitute the most prominent structure. Various studies in the field of computer vision have shown that a multilayer CNN structure consisting of a convolutional layer and a pooling layer can extract image features with different levels. At the bottom of the CNN, it generally learns basic features such as local textures and lines of the image. As the network layer deepens, the model learns more and more complex features, and its recognition ability is also raised from identifying the contour of the object enough to identify the entire image.

In CNN, the convolution and pooling operation are performed in sequence: the output of the convolution operation is used as the input of the pooling operation, and then the pooled layer result is used as the input of the next convolution layer and so on and finally sent into the Softmax layer.

Considering that the sensor data belongs to a one-dimensional time series, the input of the proposed multichannel dilated convolution network is one-dimensional time-series data, so its convolution kernel adopts a one-dimensional structure. The output of each convolutional layer and pooling layer is also corresponding to a one-dimensional feature vector, where the accelerometer and gyroscope time-series data inputs are expressed aswhere N denotes the length of the time window.

In the convolutional layer, CNN uses the convolution kernel to cope with the input data. Each convolutional layer is connected to the data in the local receptive field of the previous layer to extract local features in the local receptive field. Each particular convolution kernel can extract a differential feature. The data obtained after the convolution operation of a convolution kernel is a feature map, so we can obtain multiple feature maps through multiple convolution kernels to extract multiple features. The output of the convolutional layer iswhere is the activation function. The Restricted Linear Unit (ReLU) function [42] is widely used in deep learning to improve the performance of a deep neural network for nonlinear transformation. is the bias term for the jth feature map, is the kernel size, and is the weight for the feature map .

In the pooling layer, CNN aggregates the local features of a particular region to obtain the scale-invariant feature transform. The pooling operation reduces the dimension of processing data and the computational cost while extracting useful information. The pooling operation used in this paper, max pooling, is characterized by outputting the maximum value among a set of nearby inputs, given bywhere is the pooling size and is the pooling stride. With the stacking of convolutional layers and pooling layers, this sparse connection method can significantly reduce the number of parameters while extracting the deep features of the input data layer by layer. The obtained multichannel feature map information is first converted into a 1-dimensional vector and then input into the Softmax layer. The converted 1-dimensional vector form is , where is the number of units in the last pooling layer. The number of the Softmax layer neurons is consistent with the number of activity categories. The Softmax layer gets the probability distribution of each type of activity, and the type of activity identified by the model is the activity type corresponding to the highest probability. The process is expressed aswhere is the activity class and is the total number of activity classes. Forward propagation is performed through the above process, which gives the error values of the network.

Batch Normalization (BN) is proposed to improve the performance of CNN [43]. The BN layer can improve the data distribution during training and speed up the training of the model. Also, the BN layer has the characteristics of improving network generalization ability, to avoid the problem of overfitting and gradient disappearing during training [44]. Define the input dataset of a hidden layer of the network as , is the number of samples in the batch. First, it should compute the mean value and variance by

Then, each dimension is normalized to , whose distribution has the expected value of 0 and the variance of 1:where ε is a positive number close to zero. Finally, a pair of parameters γ and β are introduced to reconstruct and transform the data; the output data y of the BN layer is as follows:

Parameters α and β are learned along with the original model parameters.

In general, CNN finally derives robust features with the invariant character for translation, rotation, and scale from the raw data. It is as a result of the convolution operations of multiple convolution kernel network structure, which extracts the features contained in the data, and the extracted features are abstracted as the number of network layers increases. Also, due to the characteristics of sparse connections and weight sharing, CNN can reduce the number of parameters in model training and avoid overfitting [37].

3.2. Dilated Convolution

In the traditional CNN, the pooling operation can make the convolution kernel get a larger receptive field, but it is not a strict component of CNN actually [40]. Meanwhile, excessive pooling operations tend to result in a large amount of information loss [23]. Dilated convolution can expand the receptive field without pooling, allowing each convolution output to contain a wide range of information, and has been applied to problems that require longer sequence information dependencies such as speech and text. The inertial sensor signal is a typical time series, so we apply the dilated convolution to the human activity recognition model in this paper.

The principle of dilated convolution is to fill a fixed element 0 that will not adjust during the learning process between the original convolution kernels, which achieves the purpose of dilating the receptive field of the convolution kernel without increasing the number of kernel parameters [23]. The dilated convolution operation is a variant of the traditional convolution operation. If we denote r as the dilation factor, the one-dimensional mathematics of the dilated convolution are as follows:where and denote the input signal and output signal, respectively; denotes the size of the convolution kernel; denotes the dilatation rate. One-dimensional dilated convolution is achieved by inserting “0” between the pixels of the convolution kernel. For a convolution kernel, the dilation factor is , and the size of can be defined as

The convolution kernel transformed by the dilation factor of can be expressed as shown in Figure 1.

As can be seen from Figure 1, the convolution kernel becomes a dilated convolution kernel after the dilated operation with the dilatation factor .

The function of the convolution kernel is to identify certain features in the time series of the sensor. When a segment of the time series satisfies the identifiable feature of the convolution kernel, according to (9), the calculated results of the segment activate a larger value z in the new feature map and finally achieve the recognition of the features of the time series. Figure 1 reveals the change in the receptive field of the convolution kernel after the addition of the dilated convolution.

Figure 2 shows an example of dilated convolution with a three-layer convolution structure. In the third layer of convolutional layer, the traditional CNN can only capture three inputs before and after the sensor time series. Under the same conditions, dilated CNN can capture seven input data before and after. Also, dilated CNN has no change in the parameter quantity compared with the traditional CNN.

Without reducing the resolution of the feature map through the pooling layer, the dilated convolution can learn more deep essential features, thus effectively avoiding the problem of severe loss of local detail information in the sensor data. Furthermore, the convolution layer uses different dilated factors to get various sizes of convolution kernel receptive fields and then extract activity features of multiscale.

3.3. Multichannel Block Convolution Network Structure

Although traditional CNNs use filters to capture different features of an instance [25], they perform convolution operations in a single channel, which greatly limits the flexibility of parameter settings and cannot extract global and local features on multiple scales effectively. In order to enhance the robustness of the model, CNN can adopt the group convolution, that is, adopt a multichannel structure; each channel uses different convolution kernel sizes, corresponding to extracting features of different scales of the original sensor time series. Therefore, it can be seen as a fusion method of multiscale features. Figure 3 is the diagram of multichannel convolution.

In multichannel CNN, the convolution operations are grouped into multiple branches and carried out separately, and then the fully connected layer concatenates the feature maps of the branches on the channel. By using different kernels, the features of large-scale convolution kernel learning have more global characteristics, while small-scale convolution kernels get features that better reflect local characteristics.

4. Principle of Multichannel Dilated Convolution Model

4.1. Model Overview

The MDCNN model proposed in this paper is shown in Figure 4. The whole model composes two parts: feature extraction and classification. The feature extraction part is composed of three dilated convolution channels, Flatten layer, and Concat layer, wherein the dilated convolution channels are the core of MDCNN. Firstly, the sensor data is sent to the dilated convolution channels to extract features of different scales, and the three dilated convolution channels are independent of each other. Then, the Flatten layer “flattened” the other dimensions except for the time dimension into a one-dimensional feature vector and sent it to the Concat layer to concatenate the one-dimensional feature vector of each dilated convolution channel for feature splicing. Finally, in the classification part, the Softmax layer calculates the probability distribution of each type of activity for the feature parameters transmitted from the Concat layer, and the type of activity identified by our model is the activity corresponding to the highest probability in the probability distribution.

The dilated convolution channel 1 is composed of three dilated convolution layers. The model firstly extracts the features with the receptive field increasing sequentially by the dilated convolution layer with dilated factors of 2, 3, and 4, respectively. The BN layer is connected to each of the convolution layers before activation in order to increase the rate of network learning and reduce the risk of overfitting. Finally, the previously obtained feature is flattened into the fully connected layer. The structure of the dilated convolution channel 2 and channel 3 is similar to channel 1, and their convolution kernel sizes are and , respectively.

4.2. Model Training of MDCNN

The multichannel dilated convolution model discards the pooling layer on the basis of the traditional CNN, avoiding reducing resolution of the feature map caused by the pooling operation. The proposed model introduces the dilated convolution kernel to increase the receptive field of the convolution kernel and captures the long sequence information on the sensor time series, and the multichannel structure is able to extract features of multiscale.

The training and optimization of CNN depend on the loss function. The loss function calculates the error between the predicted value and the true value, backpropagates the error from the last layer to each layer of the network through the backpropagation algorithm, and updates the weights. The updated parameters continue to participate in the training, looping back and forth until the loss function value reaches the minimum; that is, the goal of the final training is reached. In this paper, the CNN model training uses a cross-entropy loss function, and it is computed bywhere is the training sensor data, is -th sample -th data’s predicted label, is a one-hot vector that represents the label of the -th data of the -th sample, M is the total number of samples, and is the total number of label classes.

Large weights can cause the weight vector to get stuck in a local minimum easily since gradient descent only makes small changes to the direction of optimization. This will eventually make it hard to explore the weight space [38]. L2 regularization is a regularization method that adds an extra term into the cost function that penalizes large weights. For each set of weights, the penalizing term is added to the LOSS function:where is the loss function without regularization, is the regularization coefficient, and is the overall weight of the model. In summary, the standardized data training set is input to MDCNN, and the model parameters are trained to obtain the recognition model.

5. Experiment

5.1. Experiment Dataset

We used smartphones dataset (HAR dataset) [45] in the UCI Machine Learning Repository in our experiments. The dataset collected a total of 10,299 sensor data from 30 subjects between the ages of 19 and 48 in lab. The dataset included six modes of action: walking, going upstairs, going downstairs, sitting, standing, and lying down, each subject carrying a smartphone to record sports data. Each subject carries a smartphone to record motion data, and the recorded data is accelerometer data and gyroscope data with a frequency of 50 Hz. The accelerometer data is separated into total acceleration and body acceleration data, and all data are then preprocessed using a noise filter and finally split into data windows with 50% overlap between each window. The dataset also offers 561 time and frequency-domain features, but we do not use these features in our experiments. Figure 5 is a schematic diagram showing the structure of a sensor data used in the experiment. The dataset is divided into a training set and a test set in a 7 : 3 ratio for the experiment. Table 1 is a description of the composition of the human activity dataset.

5.2. Experiment Result

The experimental environment of this article is a laptop with the CPU of Intel i5-8250U and RAM of 8 GB. The programming language is Python 3.7, and the framework is Keras with Tensorflow backend. In order to make the experimental process more efficient, the sample data was sent to the model experiment in batches with a batch block size of 32. The model used the Adam update rules to optimize training parameters to minimize losses and set the maximum number of training iterations to 150. The learning rate was set to 0.0015. We trained the model and tested in the test set and finally got the classification confusion matrix of Table 2.

As can be seen from Table 2, the proposed model achieved excellent recognition results that the accuracy is 95.49%, and the precisions of walking and lying down are over 98.5%. It can be found that the proposed model has a slightly lower F1 score in distinguishing between behavior patterns of sitting and standing, mainly because the two behavior patterns are both static states. The waveform of the signal collected by the sensor at rest is so low that the model cannot extract enough information from the sensor data to distinguish between the two types adequately. At the same time, it may be that CNN has some weaknesses in static activities’ identification. The next step is to improve the model further to improve the recognition accuracy for static state activities.

We compare the accuracy of the MDCNN to the other algorithms in literature according to experiment results, which are shown in Table 3. Firstly, compared with traditional methods (SVM; GHAR), our model shows a significant improvement; traditional methods rely heavily on hand-craft features. These hand-craft features from traditional methods are shallow features, which would inevitably lose some implicit key features. Secondly, we conduct experiments to compare neural network models (LSTM, CNN, and DRNN). For the three networks, CNN performs better than RNN or LSTM. CNN has advantages in feature extraction: the convolution kernel extracts abstract high-level gait features through layers, which have a decisive role in the final classification. Compared with RNN, CNN is more able to learn the crucial features contained in recursive patterns in complex cyclic processes such as gait [35].

It can be seen from Table 3 that the proposed model gets the highest recognition accuracy in addition to CNN in [38] and the multiple CNN [39]. The two CNN models incorporate frequency-domain features. The frequency-domain features seem to provide global information that is difficult to obtain in the CNN automatic feature extraction process. CNN is paying more attention to local features rather than global features. It is difficult to extract global information to a limited extent with the traditional CNN convolution kernel length. After adding the dilated convolution structure to the convolution layer, the actual length of the convolution kernel is increased, and the receptive field of the widened convolution kernel can extract longer context information. The experimental results prove that our model has improved over the ordinary CNN model that does not rely on manual features. How to extract more global features from our model would be our future work.

The identification model also needs to consider the calculation cost. CNN in [38] and the multiple CNN [39] have complex network framework, which incurs expensive computational costs and hardly meets real-time requirements in practice applications. Besides, both of them used the FFT feature, while the multiple CNN additionally used the spectrogram. The additional feature extraction consumes much time, which is also a hassle for real-time calculations. In contrast to them, the proposed model achieves almost similar performance using only raw sensor data without any manual features. MDCNN implements real-time HAR, which is difficult for these two complex models. Also, its training time and testing time are superior to other real-time deep learning models: the training time per epoch is about 6.01 s running on a laptop with the CPU, while DRNN took 116.39 s per epoch in the GPU environment. It takes only 15 minutes to complete the training process in our model, and it is hard to be negligible. However, the training process only needs to be run once in a practical application. The device loaded with a pretrained model can identify measured data in real time. In our experiment, MDCNN completed the identification of all samples within 1 s 323 μs; that is, the time to identify each sample is 0.34 ms. Because the frequency of the sensor’s data acquisition is 50 Hz, our model is sufficient to achieve real-time HAR. It is because CNN can perform parallel operations well in the training process. Furthermore, the dilated convolution achieves a more efficient convolution operation under the same computational complexity.

In general, the proposed model achieves real-time HAR with high recognition accuracy and low computational complexity. The model can automatically and efficiently mine the deep and highly recognizable essential features embedded in the data. More importantly, MDCNN expands the receptive field by introducing dilated convolution without increasing parameter, so that the model can mine the timing dependency information in the long sequence to some extent, which makes up for the defects of the traditional in time-series problems.

5.3. Network Structure Analysis

This section analyzes the impact of network structure on accuracy in the proposed model. Firstly, we design an experiment to verify whether the pooling layer is necessary for the proposed model. This experiment was compared by the difference in accuracy between the proposed model and the model with the pooling layer. The pool size of the model with the pooled layer is 2 and 3, respectively. In both sets of experiments, the pooling layer was after the last layer of convolution. The results are shown in Table 4.

It can be seen from Table 4 that the proposed model can achieve higher accuracy than the two models with pooling layers. Also, the accuracy of the model with a large pool size is lower than that of the smaller pool size. The result is because the pooling layer reduces the amount of computation while reducing the resolution, which will lose some of the information useful for classification. As the size of the pool increases, the more information is lost, and the accuracy rate also decreases.

Secondly, we designed a comparison experiment with different layers of MDCNN, which verify the validity of the dilated convolution and analyze the influence of the network depth on the activity recognition accuracy. The experiment results are shown in Figure 6.

As can be seen from Figure 6, the recognition accuracy of MDCNN improves steadily with the increase of the number of layers in 1–3 layers. It is because the advantage of CNN is to mine the nonlinear network structure contained in the raw data. If the network is too shallow, it could not make full use of the powerful fitting model ability of CNN. However, the accuracy of MDCNN recognition of the four-layer network structure is lower than that of the three-layer network. This phenomenon indicates that the deep features extracted by the four-layer MDCNN do not contribute much to the recognition effect and may even extract redundant features, which affects the establishment of the human activity recognition model.

6. Conclusion

This paper proposes an improved multichannel dilated convolution neural network (MDCNN), which not only does not need to extract features manually and reduces the dependence on expert knowledge but also has achieved excellent recognition results in the experiment. At the same time, MDCNN is also a deep learning model that can achieve real-time HAR efficiently. By introducing the structure of dilated convolution and multichannel convolution, MDCNN effectively mines raw sensor data more comprehensively, further extracts more recognizable features, and increases the diversity of feature sets. The experiments also explored the influence of MDCNN structure on recognition accuracy and constructed an ideal human behavior recognition model. It is worth pondering that MDCNN, like other deep learning models, recognizes static activities with lower accuracy than dynamic activities, which requires further improvement. At the same time, the next step will be to apply MDCNN to more complex types of activity recognition.

Data Availability

The data used in this study are from published literature articles and therefore are publicly available.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Humanities and Social Sciences Fund Project from the Ministry of Education, China (no. 17YJAZH091) and the Excellent Master Degree Thesis Cultivation Project of Fujian Normal University (LWPYS053).