Abstract

The paper proposes a system for identifying gestures and actions in smart homes. The proposed method is based on MobilenetV2 feature extraction combining with single shot detector (SSD) network. We used eleven types of gestures of walking, sitting down, falling back, wearing shoes, waving hands, falling down, smoking, baby crawling, standing up, reading, and typing for recognizing the gestures. In this system, the data are captured from the camera of mobile devices that are used to detect the object. The results are obtained objects on the frame by a bounding box. The results show that the system meets the requirements with an accuracy of over 90% that is suitable for real application.

1. Introduction

For identifying gestures, actions from still images and video sequences are challenging due to issues such as background image and lighting ratio. Many interactive applications between humans and computers or humans and robots or recently control electronic devices are widely studied. It allows computer systems to assist users for improving their lives and healthcare [13]. Two main methods for deploying the system are surveillance and wearable devices [4]. Surveillance equipment is usually fixed to the camera for user interaction or wearable devices such as smart watches that use voice to control or touch the automatic systems. In the content of this paper, we focus on the method of using fixed monitoring equipment.

ImageNet database inherits the combination of two-threaded ConvNet and recurrent neural network (RNN). In the paper, the authors receive not only information about time but also the space taken as input to the RNN [5]. Using a large number of parameters and computational complexity will not achieve high performance in terms of processing time or memory if used for feature extraction. In [6], the authors proved that the parameter number of Mobilenet is much smaller than the networks for extracting the characteristics, while the accuracy of the two models is almost the same. From there, we propose a method using Mobilenet networks which significantly reduces the number of parameters that are easy to use in weak configuration equipment.

When trying to accept human action gestures, we need to identify characteristics since computers can identify them effectively. Gestures, actions such as walking, sitting down, waving, and tying shoelaces, are very natural gestures in human life that are given priority. However, in machine learning, especially deep learning, when a large amount of computation is required, a computer with a strong configuration is required. In the paper, we reduce the MobilenetV2 network parameters by removing the full connected layer to extract the image feature. We then used the MobilenetV2 network output as the input to the SSD network to identify the action.

In this paper, there are three main points that we propose as follows. Firstly, we propose the gesture recognition system combining Mobilenet V2 and SSD. Secondly, we propose building our set of gestures that are suitable for smart-home applications. Third, we build algorithm applications running on mobile devices with real data with an accuracy of over 90%.

Image recognition is comparable to human visual perception. It has come into everyday life and serves various demands. Facebook and media platforms use the technology to enhance searching image and assist visually impaired users. Businesses use image reception to scan large databases that satisfy customer demands and improve the customer experience in their stores and online shopping. In the healthcare system, medical image recognition and processing systems help professionals predict health risks and detect disease early which provide more services to patients. The goal of action identification is to create a system that can be used to control smart-home devices. It could be applied to control digital devices in the future. This is an advanced technology in the smart-home application that allows controlling the screen without touching the device using AI technology.

The rest of the paper is presented as follows. In Section 2, we will present related work. In Sections 3 and 4, we present and evaluate the effectiveness of the proposed model, respectively. Finally, we give the conclusion in Section 5.

Identifying action is one of the applications in the control of digital devices in the future. This is an advanced technology that is being widely used in smart homes. Currently, many companies and research centers are actively testing high-tech models that allow screen control without touching the device by artificial intelligence (AI) technology. This is the area that is more concerned with action identification.

There are many studies to identify actions [24]. In [2], the authors perform 3D skeleton identification based on datasets of NTU-RGB + D and Kinetic. The authors [3] perform noron-based identity and joint trajectory maps (JTM). Khowaja and Lee [4] propose the solution to follow which is a sequential combination of Inception-ResNetv2 and long short-term memory (LSTM) network to take advantage of time variance to improve recognition performance. In this paper, the identification accuracy is 95.9 and 73.5 % based on UCF101 and HMDB51 datasets, respectively.

Besides, there are machine learning algorithms such as local orientation chart and support vector machine [713]. Thanks to the ability to learn, neural networks do not need to be manually established during the simulation process of human learning and can conduct training of gesture patterns and actions to create classification map network. The deep learning model is inspired by communication and information processing models developed from biological nervous systems including neural networks with more than one hidden layer. They can acquire the characteristics of learning subjects easily and accurately.

For complex subjects, it exhibits superior performance in computer vision and natural language processing (NLP) in [8, 9]. Modern object detection systems are variants of Backpropagation Neural Network (BPNN) and Faster RCNN in [10, 14]. In [14], the authors compared AI networks and concluded that BPNN achieved the highest efficiency. In [11], the author presents a SSD that optimizes object detection. Compared to Faster RCNN, SSDs are simpler and more efficient since it completely eliminates the stages of pixel creation and subsequent proposed reproduction. It also encapsulates all calculations in a network that makes the SSD easy to train and easy to integrate into systems. Besides, it works in conjunction with the MobilenetV2 network to operate on embedded and mobile devices quickly and efficiently.

However, there are several challenges with identifying action as follows:Developing training sample sets: identification using machine learning requires an appropriate set of sample data, so it takes time to collect data to create standard samples.Processing time: we need to process large amounts of data. If a network has to handle too many parameters with a weakly configured machine, it will slow down affecting the results in real time.Accuracy evaluation methods: for conventional cameras (webcams), accuracy is affected by other conditions such as light, background, and hand movement speed, so we have to make some assumptions for the application.

As analysis above, we propose an action identification system based on the combination of MobilenetV2 network with SSD network for easy use on embedded devices with weaker hardware configurations.

3. Proposal System

3.1. Overview of Proposal System

We propose the system based on [6, 15]. In [15], they use Resnet-101 model for object detection. Although the accuracy is high, size of network is large. Mobilenet that published later than Resnet-101 is proposed by authors from Google in 2017. In this network, the authors used a calculus convolution method called depthwise separable convolution to reduce size model and calculation complexity. As a result, the model is useful when implemented in mobile and embedded devices since we proposed to use Mobilenet and SSD to apply for our system. Metrics of convolution networks are shown in Table 1.

The proposed system is based on [12, 20] for application in smart-home models, as shown in Figures 1 and 2. With the proposed network, we first expand the number of channels by deep convolution with a kernel size of over the expanded space and finally through the bottleneck filter back several smaller channels combining with a residual connection. They are used in gradient calculations to improve performance. Besides, we also reduce the MobilenetV2 network parameters by removing the full connected layer to extract the image feature, as shown in Figure 2.

The goal of this system is to build and process datasets from simple to complex actions. The proposed gestures include eleven actions, namely, walking, sitting down, falling back, putting on shoes, waving hands, falling down, smoking, baby crawling, standing up, reading, and typing. First, the system extracts the characteristics of the data input using the mobilenetV2 network and then enters the SSD network to predict the results. The results obtained after the train process are converted to Tensorflow Lite (.tflite) format for performing on mobile devices.

The tensorflow model obtained Graphdef and checkpoint graphs after performing the training. These graphs are converted to Tensorflow Lite (tflite) format and then added to its interpreter. The interpreter executes the model using a set of operators. Details of the steps will be presented below.

3.2. Processing Steps

Tensorflow is used for creating models, training, manipulating data, and making predictions based on [12, 20, 21]. However, machine learning especially deep learning needs great computational power. Although training in mobile and embedded devices are possible, it will take a lot of time. To solve this problem, we will use Tensorflow for the training phase and Tensorflow Lite for the inference phase, as shown in Figure 1.

Proposal methods include the following steps:(i)Step 1: preparing data(ii)Step 2: assigning labels to data(iii)Step 3: using the MobilemetV2 network to extract features(iv)Step 4: using the output of MobilenetV2 network as input of SSD network to detect the object.(v)Step 5: converting to Tensorflow Lite format(vi)Step 6: creating an Android app to perform the Tensorflow Lite model.

Details of the steps are shown below.

3.2.1. Preparing Data

Firstly, we need to prepare the data including the self-built data source and the online source via Google and a part of UCF101 [22] and BU203 [23] with eight actions, namely, walking, sitting down, falling back, putting on shoes, waving hand, falling down, smoking, and baby crawling, as shown in Figure 3 [22, 23], and three actions (standing up, reading, and typing) designing by ourselves.

Number of labels and images are shown in Table 2.

3.2.2. Labeling Data

In this step, we perform the ROI determination of each action based on manual labeling. In this paper, we use a built-in labeling tool. This process basically draws boxes around objects in the image. Figure 4 is an example using the LabelImg tool that automatically creates an XML file describing the location of the object in the image.

The values obtained are shown in Figure 5 based on [24]. After labeling the data, we divide them into train/test files. Next, we convert the XML files into CSV files and then create TFRecords from these files. This TFRecords train file is given for model training. Finally, values are included in the model for evaluation.

3.2.3. Extracting Feature

The input image after being assigned will be saved in the csv format and converted into the record format in Tensorflow. We use a combination of two MobilenetV2 + SSD networks in Tensorflow to perform action identification to increase system accuracy.

In the feature extraction, we will use the MobilenetV2 network based on [6, 2629]. The MobilenetV2 network uses convolution depth separators. The blocks are constructed, as shown in Figure 6.

First, the mobilenetV2 network uses point convolution to expand the input channels. It then uses the deep convolution to extract the input feature and the convolution integrator linearly to combine the output features while reducing the network size. After reducing the size, it replaces the ReLU6 with a linear function to activate the output channel size to match the input, as shown in Figure 7.

The MobilenetV2 network also uses the reverse block to combine features over short-circuiting networks and features when traversing convolution to gain more functionality for output as follows. Depth convolution splits the input channels and filters into separate channels and then combines the output using convolution. We have the network input , where the kernel size is , and the output with the number of N channels. Depth convolution will map only on each individual input channel. Therefore, the number of output channels and input channels is the same. Its computational cost function is , as shown in Figure 8 [7].

The end result is a convolution. It is an assembly with a kernel size that incorporates features created by depth convolution. Its calculated cost is , as shown in Figure 9, based on [7]. The cost calculated on depth convolution is

Performing calculations on each filter, we average the weights on an input filter. We then infer the output feature map calculated by the formula:where is a kernel of size , is input, and is output feature map.

3.2.4. Using SSD to Detect Objects

SSD [8, 21] is a good choice for object detection due to its greater accuracy than YOLO [9] and faster speeds than Fast-RCNN [10]. SSD uses VGG-16 base network with several additional layers such as extracting feature map. However, purpose of our paper is to perform on weaker devices such as mobile devices to reduce server-side bandwidth, reduce latency, and improve speed. As a result, the system reduces the cost of mobile traffic for users due to not having to download large amounts of raw data on computer. Therefore, we propose to use MobilenetV2 network instead of VGG16 base network to extract feature map. The SSD adds additional auxiliary bits after MobilenetV2 to predict the object.

SSD model creates a vector of probability of occurrence of object, where is the number of layers and a background layer indicates that there is no object. A vector with four elements represents the position of object of frame.

After each training step, we calculate the loss function until they are reduced and adjust them to be closer to the real object. The model converges when the difference between facts and predictions is close to zero, as shown in Figure 10, based on [8, 21].

The loss function is calculated as follows [8]:

The loss function consists of two terms: and , where is the appropriate default boxes calculated as follows:where is the confidence loss that is the softmax loss over multiple classes confidences (c) ( is set to 1 by cross validation). is an indicator for matching default box to the ground truth box of category .

From the training process, we have an algorithm diagram with the input of images through mobilenetV2 network to obtain their weight. The data is then put into the SSD network to determine the coordinates and probability of the object’s appearance as well as the loss function value, as shown in Figure 11.

3.2.5. Converting objects to Tensorflow Lite (TSL) Format

TSL is a lightweight Tensorflow solution for mobile and embedded devices. It allows running machine learning models on mobile devices. The process for this model is shown in Figure 12 based on [7].

The main components of Tensorflow Lite are the model file format, the interpreter for graph processing, a set of kernels to work with, and finally the interface for the hardware acceleration layer.(i)Model file format: Tensorflow Lite is a special model file format. It is very light and less dependent on the hardware configuration. Most graph calculations are done using a 32 bit float(ii)The interpreter: it is designed to operate at a low cost and on simple configuration devices. Tensorflow Lite has very few dependencies and is easy to build on simple devices. It uses FlatBuffers. Therefore, it can download at a fast rate with flexible costs.(iii)Ops/kernel: it is a smaller set of operators. However, all models will not be supporting them. Tensorflow Lite provides an integrated core ops and is optimized for CPU using neon. They operate in both float and quantization.(iv)Proceed to increase hardware speed: it targets custom hardware. That is the neural network API Tensorflow Lite comes preloaded with links for the neural network API (API NN). If your device supports API NN, the data flow will delegate these operators to the API. Otherwise, it will execute directly on the CPU.

3.2.6. Applying Proposal Algorithm for Mobile

TensorFlow mobile is used for a mobile platform such as iOS and Android. This helps developers who have a successful TensorFlow model and want to integrate their model into a mobile environment. However, the fundamental challenges that one can find in integrating their model into the mobile environment are(i)Using the TensorFlow mobile phone(ii)Building model for a mobile platform(iii)Adding TensorFlow libraries to the mobile application(iv)Preparing model file(v)Optimization of binary size, file size, RAM usage, etc.

Tensorflow Lite is a follow-up to Tensorflow mobile. It can be done on most Tensorflow mobile devices at a very fast speed. Tensorflow Lite is a set of tools to help developers run Tensorflow models on mobile and embedded and IoT devices. It allows for machine learning inference on the device with low latency and small binary size.

Tensorflow Lite includes two main components:(i)Tensorflow Lite interpreter runs specially optimized models on a variety of hardware such as mobile phones, embedded Linux devices, and microcontrollers.(ii)Tensorflow Lite Converter converts Tensorflow models into an efficient form for use by translators and can offer optimizations for improved binary size and performance.

Tensorflow Lite is functionally different from Tensorflow Mobile in a degree that is optimized to support system transition and deployment. Tensorflow Lite has been leveraged at every level from model compilation to hardware utilization to increase the viability of inference on the device while maintaining model integrity as follows:(i)Model Transformation: the Tensorflow Lite (TOCO) converter takes a trained Tensorflow model as input and outputs a FlatBuffer-based TFLite (.tflite) file containing the binary representation of the original model.(ii)The interpreter core is responsible for executing Lite models in client applications using a Tensorflow set of operators. By limiting the default operators, libraries, and tools needed to run Lite models, Interpreter Core has been reduced to 100kb or 300kb.(iii)Hardware speed up by optimizing Tensorflow Lite reaches all hardware based on the operation of mobile and embedded devices.(iv)The amount of automation is an important component of the neural networks that are related to Tensorflow Lite. Post-training quantization is recommended in Tensorflow Lite and provided as an attribute of the TOCO converter. The results have shown that the compression model inference delay can be reduced by up to 3 times while maintaining accuracy.

To deploy the Tensorflow Lite model file on our application, we build the system of three main components, as shown in Figure 13:(i)Java API : includes C++ API functions on Android(ii)API C++: downloads Lite model and call interpreter(iii)Interpreter: uses selective kernel loading, a unique Lite feature in Tensorflow

4. Simulation and Result

4.1. Setup

During implementation to increase accuracy, input data is passed through a preprocessing step to improve quality. Through this step, the data is transferred to Mobilenet and parameters are changed such as batch size, learning rate, and multibox detection match the input data as well as computer configuration to improve the accuracy and speed up the training process.

In our simulation, we setup the parameters as follows [21].

The batch size is changed from 1 to 8. The learning rate decay policy is slightly different for each dataset and object. The initial learning rate is setup as . In three parameters, multibox detection is the most important. For each location, we have k bounding boxes. They have different sizes and aspect ratios. In our paper, we have 8732 bounding boxes with different aspect ratios 1, 2, 3, 1/2, and 1/3.

Each training image is randomly sampled by entering the original input image.

Object is 0.1, 0.3, 0.5, 0.7, or 0.9.

The size of sampled patch is selected by [0.1,1] or original image, and the aspect ratio is from 1/2 to 2.

4.2. Result

To perform the training with the action sets mentioned above, we get the results shown in Figures 14 and 15. Figures 14 and 15 show the training on CPU with Ram 8G and core . We perform for six days with the steps to reduce the loss function from 29 to 2.

We perform on models, namely, Tensorflow (RCNN + InceptionV2), Tensorflow (RFCN + Resnet101), Tensorflow Lite, and proposal model (SSD + MobilenetV2). The results are shown in Figures 1619.

We perform identification of the above set of actions with Tensorflow. The results of the operation are shown in Figures 16 and 20.

We continue to implement identification of the set of actions with Tensorflow (RFCN + Resnet101). The results of the operation are as shown in Figure 18.

We continue to implement identification of the set of actions with Tensorflow Lite. The results of the operation are as shown in Figure 17.

We perform identification of the set of actions with proposal model. The results of the operation are shown in Figure 19.

We also perform actions to check on the image background. The results are shown in Tables 3 and 4.

From the above results, we see that the system meets the requirements set out with an accuracy of over 90%. Especially, with the use of Tensorflow and Tensorflow Lite, the system achieved an accuracy of up to 99% with an execution time of 14 seconds. This is an acceptable time for an intelligent control system.

The system of gesture recognition and action is built by SSD + MobilenetV2 algorithm and trained over 2500 images. We then use each action 10 different images for each gesture and action. The results show that the system is feasible with an accuracy of over 98%. The graph freezes when converting to Tensorflow with 82% precision and 82% with Tensorflow Lite.

We made video with Tensorflow and Tensorflow Lite models on the i5 computer and 8G RAM. Memory and CPU results used for each model with the proposed dataset are shown in Table 5. The results showed that although the proposed method (SSD + MobilenetV2) has low accuracy, the processing speed is 23.6 times faster than RCNN + InceptionV2 and 37.8 times faster than RFCN + Resnet101.

After moving to Tensorflow Lite format, we created an application to evaluate the real time of the system. To estimate proposal algorithm with real video, we use the input data that includes 30 frames/second with resolution and bit rate of 82 kbps is suitable for real-time applications. The result is shown in Figures 21 and 22. The result shows that proposal model is suitable to apply for real device with accuracy up to 99%.

We perform to compare our model with [3032]. The result shows that the accuracy of the proposal method is better than that of [3032], as shown in Figure 23. As a result, the accuracy of the proposal system gains 98% with the Tensorflow model.

The training process is difficult for the computer when the amount of calculation is huge. A simple convolution 2D lattice for classifying 101 layers has about 5 million parameters while the same architecture when structured in 3D leads to 33 million parameters. However, it takes us 3 to 4 days to train 3DConvNet on UCF101 and about two months on Sports-1M [33]. This makes finding the extended architecture difficult when used with an i5 configuration with an inadequate 8G ram CPU which is time-consuming. The results comparing the model of computational efficiency with other networks are shown in Table 6.

In Table 6, accuracy of our proposal is not high (about 82%). However, the time execution and size of the model using Resnet-101 are 931 MB (Megabyte) and 2.791 (seconds/gesture), while our proposal uses only 19 MB (size of network) and 0.07 (seconds/gesture). Therefore, size of our proposed model has less than about 10 times, and execution speed is less than 40 times from 28 to 36 frames per gesture comparing with Resnet-101.The execution speed of a model usually depends on the number of parameters of the model. However, it also depends on the computational complexity that is determined by its architecture. By improving the architecture of the model, we will reduce its computational complexity and execution speed.

5. Conclusion

The paper focuses on the use of neural networks in identifying human actions. In this paper, we have identified actions with an accuracy of over 90%. However, the system still has disadvantages such as the result of recognizing the action is not high and the frame rate per second is still low. Therefore, we will perform the steps to increse the frame rate per second, to improve accuracy by increasing the resolution of the input image or using the pretreatment method used in the previous paper [34, 35], and to combine neural networks with other networks to increase the efficiency of calculations and performance with any object.

Data Availability

The data used to support the findings of the study include the self-built data source and the online source via Google and a part of UCF101 [14], BU203 [15], and HMDB51 with eight actions, namely, walking, sitting down, falling back, putting on shoes, waving hand, falling down, smoking, and baby crawling.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was carried out in the framework of the project funded by the Ministry of Education and Training (MOET), Vietnam under the Grant B2020-BKA-06. The authors would like to thank the MOET for their financial support.