Abstract

The field of activity recognition has evolved relatively early and has attracted countless researchers. With the continuous development of science and technology, people’s research on human activity recognition is also deepening and becoming richer. Nowadays, whether it is medicine, education, sports, or smart home, various fields have developed a strong interest in activity recognition, and a series of research results have also been put into people’s real production and life. Nowadays, smart phones have become quite popular, and the technology is becoming more and more mature, and various sensors have emerged at the historic moment, so the related research on activity recognition based on mobile phone sensors has its necessity and possibility. This article will use an Android smartphone to collect the data of six basic behaviors of human, which are walking, running, standing, sitting, going upstairs, and going downstairs, through its acceleration sensor, and use the classic model of deep learning CNN (convolutional neural network) to fuse those multidimensional mobile data, using TensorFlow for model training and test evaluation. The generated model is finally transplanted to an Android phone to complete the mobile-end activity recognition system.

1. Introduction

Human activity recognition belongs to a branch of pattern recognition, and its related research can be traced back to the 1980s. Because it can provide personalized support for many different applications and has connections with many different subject areas, such as medicine, human-computer interaction, and sociology [1, 2]. The research on human activity recognition has never stopped and has always been a research hot topic for researchers. Numerous researchers have tried to find a method that can efficiently and accurately identify human activities [3, 4]. Based on the support of various software and hardware, there have been many good research results in the field of human activity recognition, but the recognition effect still needs to be improved, and with the continuous development of technology and continuous research of various theories, it is necessary to continuously carry out new exploration and research in the field of human activity recognition in order to propose an efficient and accurate method of human activity recognition in the future [5, 6].

At present, researches in the field of human activity recognition can be divided into two categories, one of which is based on the analysis of video images, and the other is based on various motion sensors, such as inertial navigation modules and acceleration sensors [7, 8]. The image-based research method can be more intuitive and accurate and can better identify the complex motion state. At present, the research on human activity recognition in China is mainly based on image analysis [9, 10]. This method has its advantages, but its disadvantages are also obvious. It requires higher data acquisition equipment, and its costs are higher; it can only be used in characteristic venues [11, 12]. Therefore, this method is not very popular and is only applied in some specific fields. This research method does not belong to the research content of this article. It will not be described in detail. Interested readers can inquire related information by themselves.

Another type of activity recognition is based on motion sensors. In recent years, with the rapid development of technology, the accuracy of sensors has increased, and the cost of production has been reduced [13, 14]. Particularly in recent years, smartphone users have continued to increase at an almost explosive speed. The growth of sensor devices in smart phones is becoming more sophisticated, and wearable devices are spreading at an alarming rate [15, 16]. These have led many researchers to see the research prospects of the sensor-based activity recognition. Therefore, in recent years, research in this field has attracted more and more researchers to participate in it and has created many gratifying researches results [17, 18]. Many related researches have been applied to the daily lives of people such as medicine and sports. In China, many researchers have also proposed many cutting-edge research results. However, as the research continues and the accuracy increases, the increase in the types of identification activities has become a major problem. There is no particularly good solution, so research in this direction still requires the continuous efforts of many researchers.

The research on human activity recognition can be divided into two general directions [19]. It is divided into research based on video images, or research based on motion sensors. I will not go into details here, but the general research methods and steps are similar and are generally divided into the following steps: data collection, feature extraction, model building, and model evaluation. The research in this paper starts with data acquisition, that is, acceleration sensor data and uses TensorFlow1 (Google open-source system for artificial intelligence) to build a CNN (convolutional neural network). We further use the collected data for model training and test and evaluate and develop the model. In the end, we transplant the developed model to the mobile computing platform to implement the mobile-end activity recognition system.

The rest of this paper is organized as follows. Section 2 discusses the related work. Section 3 elaborates the data collection work in real-world environment. Section 4 explains the proposed convolutional neural network-based model. Section 5 depicts the implementation and running of our platform. Section 6 presents the experiments and analysis. Section 7 concludes the whole paper.

As early as 2010, the method of using signal strength descriptor to detect indoor movement was proposed. This method can also be applied to the detection of virtual objects near the transmitter and the movement of people moving in the room [20, 21].

Later, the method of using WiFi signal to track human body and recognize simple gestures has been developed one after another [22, 23], and the corresponding tools have also been released, such as tools that can record the detailed measurement of wireless channel and the tracking of received 802.11 packets [24]. Wireless network can realize device-free fall detection [25], which breaks through the limitation of conventional fall detection system without external modification or additional environment settings. WiFi signal can also be used to detect smoking behavior, and a passive smoking detection system based on foreground detection is realized [26]. The latest related papers also focus on multitarget tracking in mobile environment without equipment. This paper proposes an antinoise, unobtrusive, and no-equipment tracking framework [27].

In addition, the fine-grained activity recognition can be realized by RGB-D camera, which improves the intelligence of pervasive and interactive system to a new level [28, 29]. All kinds of new types of small sensors shine brilliantly in the application of human activity recognition. Through wearable sensors or even sensors based on smart phones, we can track human activities and provide health care support [30, 31]. For example, through wearable acoustic sensors, we can analyze the voice generated in the throat area of the user and accurately identify the user’s activities [32]. In Implantable Medical Devices (IMD), wireless communication is used to deal with the interference attack of others and improve the security of IMD [33]. Through WiFi signal, we can not only recognize human activity, but also hear our speech by detecting the mouth and analyzing the fine-grained radio reflection from the mouth action [34]. When analyzing and calculating the collected data, edge computing can be used to improve resource utilization and execution efficiency [35].

The application of neural network in the field of activity recognition is more and more extensive. The image classification method based on deep convolution neural network promotes the development of neural information processing system [36]. On the basis of CNN, the ability of activity recognition is improved by using data enhancement and transfer learning [37]. The framework of activity recognition based on CNN is developed [38]. When training human activity data, the activity recognition model trained on one person may not work well when it is applied to predict another person’s activity [39, 40]. In order to meet this challenge, data enhancement for human activity recognition is also a research hotspot.

3. Data Collection

3.1. Sensors of Smartphone

The data collected by the sensors of the smartphone has its own set of coordinate systems (the natural coordinate system of the mobile phone). As shown in Figure 1, the phone is positive in the x-axis direction, the y-axis is in the upward direction, and the z-axis is in the positive direction perpendicular to the mobile phone screen. The built system can monitor the change of acceleration value of the smart phone device in the three axes of the corresponding x-axis, y-axis, and z-axis.

Looking at the official Google Android developer website2, we can find that the Android platform provides thirteen sensors for use. Some of these sensors are hardware-based and some are software-based. However, not all Android sensors have all of these sensors. Different Android devices integrate different sensors and already support different devices. The acceleration sensor used in this article is based on hardware. It monitors the mobile phone’s x-, y-, and z-axis acceleration in m/s2 (including gravity, and the x-, y-, and z-axes are shown in Figure 1). This sensor is integrated in most mobile phones and tablets, and the Android platform has supported it since version 1.5 (Table 1), so using this sensor to design the activity recognition system discussed in this article can be applied to almost all the current market Android phone devices. The acceleration sensor measures the force exerted on the sensor and detects the acceleration of the equipment according to the following formula:

Gravity always affects the accuracy of measurement, which is computed according to the following relationship:

The Android platform provides a complete sensor framework, including a series of sensor classes and interfaces. Using the corresponding API allows us to easily use the functions of the corresponding sensor. The main classes used are SensorManager, Sensor, SensorEvent, and SensorListener. Each category is specifically explained with the acceleration sensor studied in this paper.(1)First, we obtain an instance of the SensorManager class and undertake the management of each sensor, such as creating a sensor instance, setting the sensor sampling frequency, and registering/deregistering sensor event monitoring.(2)Second, we create an instance of the Sensor class (acceleration sensor) through the SensorManager instance.(3)We use SensorManager to set the sampling frequency of the acceleration sensor and event monitoring.(4)We rewrite the listener method, onSensorChanged (SensorEvent event), where the accelerations on the corresponding coordinate axes x, y, and z can be obtained through event.values [0], event.values [1], and event.values [2], respectively. We can obtain the timestamp through event.timestamp.

3.2. Design and Implementation of Data Collector

Due to the timeliness of experimental purposes, the data is stored locally in txt text, one sample of data occupies one line, and each sample of data uses the following format:

Data format: user ID, time stamp, x-axis acceleration, y-axis acceleration, z-axis acceleration. Example: 1, sitting, 288956018483233, −6.3601966, −1.3551182, and 7.7943234. Register and log in. After entering the main page of the collector, select the corresponding state switch button on the interface to start collecting sensor data. Every 200 pieces of data are collected and written to the log file. Discard the data collected for the first time and start recording from the second set of data. The design of the interface of the collector is shown in Figure 2.

Due to the limitation of experiment time, environment, personnel, and other factors, the personnel involved in the sampling were only themselves. Such a sampling way is for improving the ability to train a better model with less data. In this paper, the mobile phone is sampled with the screen facing forward and the head down in the right front pocket of the pants. And the total number of samples finally obtained is 153,000. The amount of samples per activity is shown in Table 2.

The specific pie chart distribution is shown in Figure 3.

The sensor sampling frequency is 50 HZ; that is, one piece of data is collected every 0.02 s. The data are collected in 200 pieces. Here, waveform analysis of the first 200 pieces of data collected for each activity is performed, where the data are collected by the sensor in 4 seconds. The horizontal axis of the waveform graph is a time stamp, and the vertical axis is the x, y, and z accelerations and the true acceleration after normalization. The computation is as follows:where Acc denotes the total acceleration.

Each activity corresponds to 4 waveforms. By observing the waveform diagram of the sensor data corresponding to human activities, we can intuitively feel that the sensor data generated by different activities are different, and they have a certain regularity. Therefore, we can use deep learning to let machines learn this law and generate corresponding models to achieve the purpose of identifying related human activities. The 4 s sampling waveform of downstairs is given in Figure 4, the 4s sampling waveform diagram of running is given in Figure 5, and the 4 s sampling waveform of sitting is given in Figure 6. The 4s sampling waveform of standing is shown in Figure 7, the 4 s sampling waveform of upstairs is shown in Figure 8, and the 4 s sampling waveform of walking is shown in Figure 9.

4. Convolutional Neural Network-Based Model

Convolutional neural network (CNN) belongs to a type of feedforward neural network. Convolutional neural network has been widely used in the field of image and speech recognition because of its better test results. The most commonly applied field of CNN is in the field of pattern recognition, especially for large-scale image processing. It has its extraordinary performance. Because it can make images directly input to the network, it can avoid complicated feature extraction processes and data reconstruction process. However, due to its continuous innovation, convolutional neural networks (CNN) are now also used in the fields of video analysis, intelligent language processing, and drug discovery. And at present, it has become one of the hotspots in many scientific fields. Various fields are trying to use convolutional neural network technology to add new vitality to their fields.

Since the successful release of the AlexNet architecture in 2012, a series of classic architectures such as VGG, GoogLeNet, and ResNet have appeared successively. In recent years, researchers have continued to design many new methods to improve CNNs. Therefore, many variants of CNN architectures have been proposed. Therefore, in different literatures, some detailed description about convolutional neural networks (CNN) may be biased in some places. However, no matter how the variants are, the basic concepts and principles of the CNN architecture will not change, and their various components are also very close. We adopt LeNet-5 in this paper and adjust the parameters in order to meet our demand. LeNet-5 can be divided into 6 layers in addition to the input layer and output layer. Each layer contains a different number of training parameters (connection weights), as shown in Figure 10. The specific structure is convolution layer, pooling layer, volume layer, pooling layer, fully connected layer, and fully connected layer.(1)Convolutional layer: Convolutional layer is used for feature extraction. In convolutional neural networks, we often use multiple layers of convolutional layers to get deeper feature maps.(2)Pooling layer (lower sampling layer): The main work of the pooling layer is to compress the input feature map along the spatial dimension (height and width). The pooling layer can compress the feature map output by the convolutional layer to extract the main features, thereby reducing the number of parameters and accelerating the neural network. And pooling has translation invariance, which enables us to extract the feature maps unchanged after the image is panned and scaled, helping us to make correct and identical recognition results for images that have been panned and scaled.(3)Fully connected layer: The work of the fully connected layer is relatively simple. It connects all the features and transmits these output values to the (i.e., SVM and Softmax) classifier for final classification and judgment.

The computation procedure of LeNet-5 employed in this paper is as follows. The weights that are generated from the previous layer are represented as for the neural unit i. We use sigmoid function to generate the state zi, which is computed as follows:where sigmoid() refers to the sigmoid function. The output layer is computed using RBF function (radial basic function) to compute the result over each class. RBF computes the final result over each class aswhere represents the ground truth of states.

The overall structure is similar to LeNet-5 in our paper, except for input and output, and it can be divided into six layers. We use the visualization module that comes with TensorFlow to visualize the specific neural structure in front of us. The overall neural network structure of this paper is shown in Figure 10. Except for the input and output, the specific neural network structure is convolution layer-pooling layer-convolution layer-pooling layer-fully connected layer-fully connected layer. The CNN structure used in our work is presented in Figure 11.

5. The Implementation and Running of Our Platform

TensorFlow is a set of Google open-source machine learning systems, an upgraded version of DistBelief. According to the official statement, TensorFlow can improve its performance by almost 2 times in some benchmark tests compared to its generation DistBelief. In fact, if TensorFlow is strictly speaking, it is not a neural network library, but it is often used to implement neural networks. In essence, TensorFlow should be an open-source software library which takes the form of a data flow graph and uses it for numerical calculations. Therefore, as long as your calculations can be expressed in the form of a data flow graph, then we can use TensorFlow to implement the calculations. So, TensorFlow can be said to be a very powerful and highly flexible tool.

In this paper, we finally use TensorFlow to build a Convolutional Neural Network (CNN). In addition to the above mentioned, another crucial point is the important feature of TensorFlow portability. This feature can provide strong support for the ultimate purpose of this paper to implement an activity recognition system in a smart phone terminal. We can easily port our trained models seamlessly to mobile phone projects.

This paper will use Pandas and NumPy for related data processing; use some functions under the scikit-learn package for machine learning analysis; and Matplotlib is the suite we use to plot. Therefore, the reader can install these separately. Of course, for convenience, we can directly install Anaconda like this article, which integrates many third-party libraries related to scientific computing, such as NumPy and Pandas. The whole process includes four components, which are data preprocessing, data normalization, data sampling, and saving of data.

6. Experiment and Evaluation

According to the convolutional neural network model designed in Section 3, we use the function method provided by TensorFlow to build the network and configure the corresponding parameters.

6.1. Training Model

In order to train the model, we need to define an index to evaluate the quality of the model. In general, we define a loss index, the so-called loss to indicate that the model is bad, and then try to minimize this level of index. This paper uses cross entropy as the loss function. We do not make a detailed derivation of cross entropy but give its definition as follows:

Among them, yi is the predicted probability distribution, and yi is the actual distribution. In actual implementation, we compute the code as follows:

Among them, y_ corresponds to yi in the above formula, and Y corresponds to yi in the formula, and cross_entropy is the cross entropy loss we define. Then, we choose the shaving descent algorithm to continuously back-propagate and modify the variable values to reduce costs. In this paper, the batch size is set to 50, the learning rate is 0.0001, and the number of iterations is 4 times. Finally, the test result with a correct rate of 98.24% is obtained. The reason why we only need to iterate 4 times to achieve convergence, this article believes that because of our previous expansion of the data, a lot of similar data appeared in the training set, which accelerated the convergence of the network. Imagine if batch training is performed in the same two batches, the training effect of one iteration can be equivalent to two.

6.2. Evaluation Model and Parameter Tuning

This paper uses the TensorBoard tool to track the changes in the loss and accuracy values during the training process. From the change of the two values of our final model, it can be seen that the loss value we finally got to the model quickly went to 0 during the training process, and the accuracy value was in the training process quickly reached a level close to 1. Therefore, it can be judged that the effect of the generated model is still relatively good.

The loss in this article uses the cross-entropy loss mentioned above, which is not repeated here. We introduce how to obtain accuracy. We can use the tensorflow.argmax function to give the index value of the maximum data value of a tensor object in a certain dimension. And labels we previously changed to one-hot encoding through the pandas.get_dummies function, so the label values are all composed of 0 and 1. Therefore, it can be known whether the prediction is correct by comparing whether the index of the predicted label value and the actual correct label value are the same. Then calculate the proportion of the number of correct predictions, which is the correct rate of prediction. This value is important for model evaluation.

In addition, this paper also uses 3 indicators to evaluate the quality of the model: F1-score, recall, and precision. The higher the recall score, the stronger the model’s ability to recognize positive samples. The higher the precision score, the stronger the model’s ability to distinguish negative samples. This shows that the model is more robust. The calculation formula of the three indicators is as follows: where P denotes precision and R denotes recall. TP represents true positive samples, TN represents true negative samples, FP represents the false positive samples, and FN represents the false positive samples.

In order to improve the scores of these indicators, this paper uses the control variable method to compare multiple groups of experiments and adjust the model parameters. In the end, after continuous optimization, the three indicators scored by the model generated by the test using the test set are that precision is 0.9825, recall is 0.9824, and F1-score is 0.9823. Finally, we applied the model to output the corresponding confusion matrix on the test set. The so-called confusion matrix is to use columns to represent the predicted value that is given by the model according to the input, and the rows correspond to the ground-truth categories. Through the confusion matrix, we can intuitively see that based on the predictions made by the model, we can make a better decision on the status of our model. The confusion matrix for the test set application to the final model is given in Table 3.

From the analysis of the confusion matrix, we can conclude that we made 10608 predictions, of which 187 were wrong. There are 72 errors on the upstairs; the error rate is 4.55%. It can be seen that going upstairs is more likely to be confused with going downstairs and walking. There are 52 errors on the walk; the error rate is 3%. It is shown that walking is more likely to be confused with going upstairs. There are 51 errors on the running; the error rate is 4.35%. Running is more likely to be confused with going downstairs and upstairs. There are 12 errors on the downstairs; the error rate is 0.67%. This activity is more likely to be confused with going upstairs and walk. The accuracy of sit and stand is 100%. Therefore, it can be concluded that the model is more accurate in the recognition of standing, sitting, and going down, while the recognition of going up, running, and walking is slightly inferior.

7. Conclusion

This paper starts with data collection, uses convolutional neural network modeling, and finally transplants the model back to the mobile phone to complete the activity recognition system. After a series of experiments and tests, we found that the method of collecting data based on mobile phone sensors and then training the model through a convolutional neural network can well complete the task of activity recognition. In the absence of data, this paper can finally train a model with an accuracy rate of more than 98% through some operations on the data, model optimization, and parameter adjustment.

We verify the practical feasibility of the model by transplanting it to a real machine test. Therefore, this article concludes that it is completely feasible to train a convolutional neural network model based on mobile phone sensor data to complete an activity recognition system, and this method has great potential.

Data Availability

The underlying data supporting the results of this paper are generated during the study.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the Major Special Science and Technology Project of Hainan Province (Grant no. ZDKJ2017012), the National Key R&D Program of China (no. 2020YFB2104004), and the Qinghai Key R&D and Transformation Project (no. 2021-GX-112).