Abstract

In recent days, research in human activity recognition (HAR) has played a significant role in healthcare systems. The accurate activity classification results from the HAR enhance the performance of the healthcare system with broad applications. HAR results are useful in monitoring a person’s health, and the system predicts abnormal activities based on user movements. The HAR system’s abnormal activity predictions provide better healthcare monitoring and reduce users’ health issues. The conventional HAR systems use wearable sensors, such as inertial measurement unit (IMU) and stretch sensors for activity recognition. These approaches show remarkable performances to the user’s basic activities such as sitting, standing, and walking. However, when the user performs complex activities, such as running, jumping, and lying, the sensor-based HAR systems have a higher degree of misclassification results due to the reading errors from sensors. These sensor errors reduce the overall performance of the HAR system with the worst classification results. Similarly, radiofrequency or vision-based HAR systems are not free from classification errors when used in real time. In this paper, we address some of the existing challenges of HAR systems by proposing a human image threshing (HIT) machine-based HAR system that uses an image dataset from a smartphone camera for activity recognition. The HIT machine effectively uses a mask region-based convolutional neural network (R-CNN) for human body detection, a facial image threshing machine (FIT) for image cropping and resizing, and a deep learning model for activity classification. We demonstrated the effectiveness of our proposed HIT machine-based HAR system through extensive experiments and results. The proposed HIT machine achieved 98.53% accuracy when the ResNet architecture was used as its deep learning model.

1. Introduction

The human healthcare systems have a vital role in our daily life. Due to the busy lifestyle, these days, the lack of exercise causes serious health issues. Emerging technologies such as human activity recognition (HAR) systems [1] can monitor the users’ activities in the healthcare system. Recent research trends in HAR show its wide variety of applications that include health and fitness monitoring [2], assisted living [3], context-enabled games and entertainment [4], social networking [5], and sports tracking [6]. In HAR, the system tracks the user’s movements and classifies the user’s activities based on the sensor reading. The existing HAR system includes vision-based [7], radiofrequency-based [8], or wearable sensor-based approaches [9]. The most common and low installation cost-based HAR technique is the wearable sensor-based approach. The sensor-based technique is location independent, and the user can easily hold the sensor during their activities. The sensor-based HAR approaches achieved a remarkable classification accuracy, and smartphone or smartwatch-based HAR is the most common system used for activity recognition. However, the sensor errors, sensor type, sensor position in the human body, and user’s complex activities make the system more challenging for activity recognition. The HAR system has worst classification results when the user is in complex activity motion. On the other side, when the HAR system uses radio frequency (RF) signals for activity recognition, the system takes advantage of the wireless communication features to classify the user’s activities. Compared with the sensor-based HAR approach, RF-based HAR is device-free, and the system does not need any physical sensing module. The device-free characteristics of radio frequency-based HAR provide reduction in energy consumption and privacy protection compared with the sensor or vision-based HAR systems. However, indoor channel conditions, non-line of sight conditions, and signal interference affect the performance of HAR, and the system faces difficulties in maintaining high accuracy levels. Besides these HAR approaches, the vision-based HAR system uses a camera that records the user’s activities in a video sequence. The vision-based approach uses computer vision algorithms for activity recognition. Based on the camera type used in the HAR system, the video sequence from the vision approach is in the form of RGB videos [10], depth videos [11], or RGB-D videos [12]. Compared with sensor-based or radio frequency-based HAR approaches, the vision-based approach shows higher classification results for users’ complex activities. However, user privacy, energy consumption, and deployment cost are the main challenges for the vision-based HAR approaches. In this paper, our research focuses on the vision-based HAR approach, and we propose a human image threshing (HIT) machine-based HAR system that addresses some of the existing vision-based HAR challenges. Our HIT machine-based HAR system uses a smartphone camera as an input device to record the users’ activities. A mask region-based convolutional neural network (R-CNN) further processes the recorded activity videos for human body detection, a facial image threshing machine (FIT) for image cropping and resizing [13], and a deep learning model for activity recognition. Our HIT machine can generate HAR images from activity videos, human body detection from images, data cleaning and removal of irrelevant data, and activity classification using a deep learning model. We tested our HIT machine with different HAR experiments based on deep learning models, including visual geometry group (VGG) [14], Inception [15], ResNet [16], and EfficientNet [17] models. The results from the HIT machine show that the system always maintains the classification accuracy for activity recognition. We analyzed our HIT machine results with conventional HAR approaches that include inertial measurement unit (IMU) and stretch sensor-based approaches. The results show that the HIT machine outperforms the traditional sensor-based approaches with a higher level of accuracy for activity recognition. We also tested our pre-trained deep learning models with unseen HAR datasets and analyzed the classification performance. The key contributions from our HIT machine are stated as follows:(i)We created a HAR dataset using a smartphone camera, IMU sensor, and stretch sensor. Our dataset consists of nine activities: sitting, standing, lying, walking, push up, dancing, sit-up, running, and jumping. It has 36, 558 image samples from smartphone cameras, 97,454 data samples from IMU sensors, and 7,850 data samples from stretch sensors. We used these datasets to validate our HIT machine, and the deep learning models can use our HAR datasets for training and testing without any computational complexity. We also collected HAR datasets for unseen datasets and tested them with pre-trained deep learning models.(ii)We proposed a HIT machine for activity recognition, and our HIT machine shows accurate classification results for basic (sitting, standing, and walking) and complex (running, jumping, and lying) activities. We tested our HIT machine with different deep learning models and analyzed the classification performance in terms of a confusion matrix, accuracy, loss, precision, recall, and F1 score. We also tested the pre-trained models with unseen HAR datasets and compared the performance of each model. We validated our HIT machine results with sensor-based HAR results and proved the impact of the HIT machine for activity recognition.

The rest of the paper is organized as follows: Section 2 discusses the existing HAR systems, recently proposed HAR systems with their advantages, and current HAR challenges for practical implementation. Section 3 presents our proposed HIT machine-based HAR system, including mask R-CNN, FIT machine, and deep learning models. Section 4 discusses our HAR experiments with the validation of our HIT machine in terms of the impact of various deep learning models, analysis of unseen datasets for pre-trained models, and the result comparison with conventional HAR approaches. Finally, Section 5 concludes our HIT machine-based HAR approach with future research directions.

HAR has been studied for applications in healthcare monitoring, smart homes, security, medical imaging, robot/human interaction, personal assistants, and surveillance [1820]. Many researchers have discussed various HAR approaches based on the technologies or algorithms used for activity recognition [2125]. In this paper, our literature focuses on related work for HAR approaches that include sensors [26, 27], Wi-Fi [28], Wi-Fi, and sensors [29], vision [30, 31], and RFID [32]-based activity recognition. The HAR approaches from [2632] provide significant performance improvements for HAR applications. However, the diversity of age, gender, and number of subjects, postural transitions, number of sensors and type of sensors, different body locations of wearable sensors or smartphones, missing values or labeling error, similar postures and datasets having complex activities, lack of ground truths, selection of appropriate datasets, and selection of sensors [33, 34] create challenges to the HAR implementation. This paper proposes a HIT machine-based HAR system to address some of these challenges with higher classification results.

The sensors-based HAR approaches are the most common and popular HAR systems. In sensor-based HAR, the system uses wearable sensors, smartphones, or smartwatches to collect data and identify the users’ activity based on the sensor readings. Some of the recent HAR systems which take advantage of wearable sensors are discussed in [3539]. These systems achieved a remarkable recognition accuracy in real time. However, mounting a wearable sensor in the human body is challenging, and the wearable sensor’s position determines the system’s performance. The wearable sensor-based HAR systems still need to optimize the location of sensors in the human body for complex activity. An alternative method for activity recognition is the smartphone-based HAR systems [4043]. In smartphone-based HAR, the user holds the smartphone and performs the activities. Compared with wearable sensor-based approaches, the smartphone-based method is simple and easy to implement in any place without any external sensors. However, the position in which the smartphone is held and the modes such as texting and calling affect the system’s performance. The smartphone or wearable sensors-based HAR approach still needs to improve the classification performance at a certain level, and current systems use deep learning models for activity recognition [4447]. The deep learning HAR-based systems include convolutional neural network (CNN) [48], long short-term memory (LSTM) [49], LSTM-CNN [50], deep recurrent neural networks (DRNN) [51], generative adversarial networks (GAN) [52], extreme learning machine (ELM) [53], graph neural network (GNN) [54], and semi-supervised deep learning models [55, 56]. These systems use raw sensor reading or extract the signal features in the time/frequency domain for activity recognition. When the system uses the signal in the time domain, it extracts the variance, mean, maximum, minimum, and range values and uses these features as model inputs. On the other hand, If the signal is in the frequency domain, the system extracts the amplitude, skewness, kurtosis, and energy information as to its features and uses this input to the model. Compared with the raw input signal-based deep learning HAR approach, the feature-based approaches show better classification results [2]. However, the deep learning-based HAR approaches are not free from challenges. A large number of data samples for training, training time, the complexity of feature extraction, and human resources required for data collection are some of the main challenges of deep learning-based HAR approaches. These challenges reduce systems performance and require further classification improvements.

The RF-based HAR approaches use physical sensors, such as pressure, proximity, FM radio, microwave, or RFID for activity recognition [5761]. In a radio frequency-based approach, the system takes advantage of the body attenuation and the channel fading characteristics for activity recognition. The basic principle of RF-based HAR systems is that the propagation of RF signals is affected by the human body movement, resulting in attenuation, refraction, diffraction, reflection, and multipath effects. These pattern differences in the received RF signals are the key ideas for activity recognition. Different activities lead to various patterns inside RF signals, and the system can use these features for classification. The RF-based systems consist of signal selection, model, signal processing, segmentation, feature extraction, and activity classification. In signal selection, the system uses Wi-Fi, ZigBee [63], RFID [64], frequency-modulated continuous-wave radar (FMCW) or acoustic devices. The system uses phase, frequency, amplitude, or raw signal for activity recognition depending on the signal selection. These factors determine the model of the HAR system. When the model is defined, the system uses signal processing techniques, including noise reduction, calibration, and redundant removal. After this, the system uses signal segmentation in the time or frequency domain. When segmentation performed, the time domain, frequency domain, time-frequency domain, or spatial domain features are extracted for classification. The deep learning models use extracted features for activity recognition. Compared with the wearable sensor-based HAR approach, the RF-based approach exploits the wireless communication features for activity recognition. These systems do not use any physical sensing module, thus reducing energy consumption and user privacy concern. Some of the RF-based HAR approaches are discussed in [6568]. The RF-based systems discussed here have enhanced the HAR classification performance and opened many applications for detection, recognition, estimation, and tracking. However, the wireless channel conditions, signal interference, non-line-of-sight (NLOS) conditions, multi-user activity sensing, and limited sensing range make the systems more challenging. They require new theoretical models and open datasets for accurate classification.

The system uses a video sequence for activity monitoring when considering a vision-based HAR approach for activity recognition [69, 70]. The vision-based approach is best for multi-user activity recognition when privacy is not a significant concern. These systems use different computer vision algorithms on activity videos to predict the user’s activities from videos or images. Some of the vision-based HAR approaches are proposed in [7177]. These vision-based systems effectively use the video or image sequences and classify the users’ activity by taking advantage of the recent deep learning models. Several review papers on the vision-based HAR systems are discussed in [7880]. From vision-based HAR review discussions, the authors from [81] focus on the high level of visual processing, including human body modeling, understanding of human actions, and approaches to human action recognition. In [82], the authors presented the current state-of-the-art development of automated visual surveillance systems. They discussed the necessity of intelligent visual surveillance in commercial, law enforcement, and military applications. In [83], the paper reviews the advances in human motion capture and analysis from 2000 to 2006 and discusses the problems for future research to achieve automatic visual analysis of human movement. The review paper [84] analyzes the approaches taken to date within the computer vision, robotics, and artificial intelligence communities to represent, recognize, synthesize, and understand action. In [84], the authors pay more attention to identifying actions at different levels of complexity. Machine recognition of human activities is reviewed in [85], and the authors present a comprehensive survey of efforts to address the vision-based HAR systems. The paper [80] focuses on pedestrian detection, and [86] introduces a HAR system that recognizes the human behaviors from transit scenes. The most recent HAR systems are presented in [8789]. These systems tried to improve the feature extraction techniques by introducing object detection, skeleton tracking, and human body poses. The vision-based HAR systems discussed here still have some challenges, such as processing high-quality videos or images, the complexity of the vision algorithms, the requirement for a higher graphics processing unit (GPU) processing power, the installation cost of the camera, and challenges from vision systems such as camera viewpoint, lighting, human body appearance, occlusion, and background clutter. These challenges make it more difficult for the vision-based approaches for real-time health monitoring.

So far, we have discussed different types of HAR approaches based on their technologies and algorithms used for activity recognition. In this paper, our research mainly focuses on the vision-based HAR approach, and we used our smartphones for data collection. We also collected data using IMU and stretch sensors, and the results from these sensors are compared with our proposed HIT machine. The experiment results show that the HIT machine is a practical HAR approach for healthcare applications and needs only a basic smartphone model for activity recognition.

3. Proposed HIT Machine-Based HAR System

The HIT machine consists of HAR dataset creation, data preprocessing, human body detection using mask R-CNN, image cropping and resizing, data cleaning and removal of irrelevant data, deep feature extraction, model building, and activity classification. Figure 1 shows the framework of our proposed HIT machine-based HAR system.

We first started our data collection in the HIT machine by using android and iOS smartphones that record activity videos. Next, the HIT machine performs the data aggregation on the activity video sequences. The data aggregation gathers all activity data and presents it in a summarized format. Followed by the data aggregation process, our system uses a mask R-CNN algorithm for human body detection. After this, the HIT machine operates the FIT machine for image cropping and resizing when the human body is identified from images. The cropped and resized activity images are ready for the model to use for training and testing. Our HIT machine also used a data cleaning process that removes the unnecessary images from the HAR dataset. After the data cleaning process, the images are ready to be used for model training and testing. We extracted the features from the activity images and created a deep learning model that classifies user activities into nine groups. The output of the HIT machine is the classification results of user activities which include sitting, standing, lying, walking, push up, dancing, sit-up, running, and jumping. The flowchart of the proposed HIT machine is presented in Figure 2.

In the flowchart, the system starts with HAR datasets. The datasets include HAR images from smartphones, accelerometer and gyroscope readings from IMU sensors, and stretch sensor readings. The HAR image dataset is then divided into training, testing, and unseen datasets. We used our HIT machine in the image HAR dataset for human body detection and activity recognition. The HIT machine includes human body detection, data preprocessing using a FIT machine, and deep learning models for classification. A mask R-CNN-based object detection algorithm is used for human body detection. A FIT machine is used for data preprocessing, including image cropping, resizing, data cleaning, and data segregation. A deep learning model is used for the training, and the model classifies the user activities into different categories. The system uses deep learning models of VGG, Inception, ResNet, and EfficientNet. On the other hand, conventional HAR approaches use IMU and stretch sensor data for activity recognition with a CNN model. The CNN model also uses the HAR image dataset for activity recognition, and we compared the effect of our HIT machine (with and without HIT machine) for activity recognition. Further discussions of mask R-CNN, FIT machine operation, and the deep learning models are added in the following subsections.

3.1. Mask R-CNN

In computer vision, mask R-CNN is widely used for object detection tasks [90]. The mask R-CNN separates different objects from a video or an image. The algorithm provides the object bounding boxes, classes, and mask information, and our HIT machine can effectively utilize this information for human body detection. The mask R-CNN from our HIT machine operates in two stages. First, the algorithm generates proposals about the regions where an object is located in the input image. Second, the algorithm predicts the object class and refines the bounding box. The algorithm also adds a mask in the pixel level of the object based on the first stage proposal. Compared with Fast/Faster R-CNN-based object detection approaches, the mask R-CNN-based approach has additional features such as a binary mask for each region of interest (RoI). Our system utilizes this binary mask feature for human body detection. Figure 3 shows the structure of mask R-CNN.

The mask R-CNN consists of a backbone, a region proposal network (RPN), a region of interest alignment layer (RoTAlign), an object detection head, and a mask generation head. The backbone of mask R-CNN is the primary feature extractor which uses residual networks (ResNets) with or without feature pyramid networks [91]. When our HAR images are fed into a ResNet backbone, the images go through multiple residual bottleneck blocks and turn into a feature map. The feature map contains the abstract information of input images, including different object instances, classes, and spatial properties. The feature map data are then fed into the RPN layer. In this layer, the network scans the feature map and RoI where the human body is located. The next step is to find each RoI from the feature map. This process is referred to as RoIAlign in Figure 3. The RoIAlign extracts the feature vectors from the feature map based on the RoI suggested by the RPN layer. The feature vectors are then converted into a fix-sized tensor for further processes. The outputs from RoIAlign are then processed by two parallel branches: object detection branch and mask generation branch. The object detection branch is a fully-connected layer that maps the feature vectors to the final classes and bounding box coordinates. The mask generation branch feeds the feature map into a transposed convolutional layer and convolutional layer. The output of mask generation branch is one binary segmentation mask that is generated for one class. Then the system picks the output mask based on the class prediction from the object detection branch. Figure 4 shows the human body detection using our HIT machine for nine activities.

As shown in Figure 4, the mask R-CNN accurately detects the human body for nine activities without any detection error. The mask R-CNN used here is straightforward and has a small computational overhead that enables a fast system and rapid experimentation. For more details on mask R-CNN and its implementation, refer to [9294].

3.2. FIT Machine

The HIT machine effectively uses our previously proposed FIT machine for image cropping and resizing [13]. The FIT machine is used to correct missing HAR datasets, remove irrelevant data, merge datasets on a massive scale, and crop and resize images. Our FIT machine converts input activity video sequences into the image output samples that consist of cropped, resized, and categorized activity images. The FIT machine contains a data receiver, a multi-task cascaded convolutional network (MTCNN), an image resizer [95], and a data segregator as the pre-trained Xception algorithm model [96]. The data receiver converts activity video sequences into images, and the MTCNN identifies the human faces from the activity images. The MTCNN used here consists of P-Net, R-Net, and O-Net layers. When the architecture detects the human faces, the input images enter the P-Net layer, which chooses the possible face frames from the input images. The R-Net layer in the MTCNN uses P-Net outputs as its inputs. The R-Net layer inspects the given initial frames from P-Net, then removes the face frames that do not reach a threshold score. Followed by the R-Net, the O-Net uses the output from the R-Net at the end. In the O-Net layer, it selects the best face frames from the given output from R-Net. Next, the images are passed through an image resizer, reducing the image size to 224224 pixels. The last part of the FIT machine is a data segregator, which segregates the activity images into adequately labeled directories. The data segregator contains a pre-trained Xception model made by a depth-wise separable convolution layer. The depth-wise separable convolution layer used in the model splits each channel of the input and filter separately. The layer convolves them by each channel and later separates one element of 3 channels to be convoluted until all aspects have been convoluted. The architecture also has some shortcut structure that skips over the block of the depth-wise separable convolution layers. The model uses a categorical cross-entropy loss function as the metric loss measurement. For more details on the FIT machine, refer to [13].

3.3. Deep Learning Models

The last stage of the HIT machine is the deep learning models. Our HAR dataset is trained with deep learning models and classifies user activities into sitting, standing, lying, walking, push up, dancing, sitting, running, and jumping. The HAR dataset consists of image samples, and our system considers four image classification models VGG, ResNet, Inception, and EfficientNet, as the deep learning models. Figure 5 shows the deep learning models used by our HIT machine.

The most common image classification model is the VGG model introduced by the visual graphics at University of Oxford [14]. The VGG model consists of 13 convolution layers, five pooling layers, and three dense layers. The VGG model is sequential in nature and uses many filters one after another. The architecture uses a stack of convolutional layers with different depths in different architectures followed by three fully-connected (FC) layers. The first two FC layers have 4,096 channels each, and the third FC performs the 1,000-way classification. The last layer is the soft-max layer that is used to normalize the classification vector. All the hidden layers in the VGG architecture use rectified linear unit (ReLU) as the activation function. The ReLU activation function is computationally efficient, and its results are in faster learning. The ReLU function also reduces the likelihood of vanishing gradient problems and improves the classification performance. Figure 5(a) shows the architecture of the VGG network.

Next, our HIT machine used a deep learning model, which was developed by Google [16]. The GoogLeNet or Inception is a smaller network than the VGG model and uses an Inception module. The Inception module performs convolutions with different filter sizes on the input images, performs Max Pooling, and concatenates the result for the next Inception module. The architecture uses a 11 convolution operation which reduces the parameters drastically. This architecture is designed to solve the problem of computational expense, overfitting, and other deep learning model issues. The Inception model takes advantage of the multiple kernel filter sizes within the CNN, and rather than stacking them sequentially, it orders them to operate on the same level. Figure 5(b) shows the Inception architecture used by our HIT machine. The architecture has nine inception modules stacked linearly and has 22 layers deep (27, including the pooling layers). It uses global average pooling at the end of the last inception module. Compared with VGG networks, Inception networks are more computationally efficient in terms of the number of parameters generated by the network and the computational cost incurred. For more details on the Inception model, refer to [16].

Our HIT machine also analyzed the impact of the ResNet architecture for activity recognition. The main idea of ResNet architecture is to avoid poor accuracy when the model uses deeper layers. This model is mainly designed for the gradient vanishing problem. Figure 5(c) shows the ResNet architecture used by our HIT machine. The ResNet architecture is a 34-layer plain network inspired by VGG-19 networks, which adds shortcut connections. These shortcut connections then convert the ResNet architecture into the residual network. The first two layers of the model are the same as those of the Inception model. The model uses a 77 convolution layer with 64 output channels and a stride of 2 followed by the 33 maximum pooling layer. The major difference with ResNet is the batch normalization layer which is added after each convolutional layer. The inception model discussed previously uses four modules which are made up of Inception blocks. However, the ResNet architecture uses four modules which are made up of residual blocks. Each residual block uses several residual blocks with the same number of output channels. The first module from the architecture uses the number of channels that are the same as the input channel numbers. From the first residual block of each subsequent module, the number of channels is doubled compared with the previous module, and the height and width are halved. Compared with Inception architecture, the ResNet model is more straightforward, easy to modify, easy to optimize, and achieves higher accuracy when the depth of the network increases. For more details on ResNet architecture and its implementation, refer to [15].

At last, our HIT machine used a model called EfficientNet from Google for activity recognition [17]. In EfficientNet, a new scaling method called compound scaling is introduced. The model ResNet discussed before follows a conventional approach of scaling the dimensions arbitrarily and adding more layers. However, if the model scales the dimensions by a fixed amount simultaneously and does so uniformly, the model achieves better performance. The user can decide the scaling coefficients. EfficientNet architecture is a convolutional neural network architecture with different scaling methods. In EfficientNet, the architecture uniformly scales all depth/width/resolution dimensions using a compound coefficient. Compared with conventional ways that arbitrarily scale these factors, the scaling method in the EfficientNet architecture uniformly scales network width, depth, and resolution with a set of fixed-scaling coefficients. Figure 5(d) shows the EffientNet architecture used by our HIT machine. The main building block of this architecture consists of mobile inverted bottleneck Convolution (MBConv), to which squeeze-and-excitation optimization is added. The MBConv layer is similar to the inverted residual blocks used in MobileNet v2 [97]. The MBConv creates a shortcut connection between the beginning and end of a convolutional block. The input activation maps are first expanded using 11 convolutions, increasing the depth of the feature maps. 33 depth-wise convolutions and point-wise convolutions follow this, and this structure reduces the number of channels in the output feature map. The shortcut connections connect the narrow layers, while the wider layers are present between the skip connections. This form of structure decreases the overall number of operations required as well as the model size. For more details on the EfficientNet architecture and its implementation, refer to [17].

4. Experiment Results and Analysis

We collected HAR datasets from different users to validate our proposed HIT machine-based HAR approach. There were 10 volunteers for data collection, consisting of five members for the training dataset and five for the unseen dataset. The demographic information of participants is given in Table 1.

We used Samsung galaxy note eight and iPhone 11 pro smartphone models for video recording. The smartphones were kept stationary during the initial stage of the experiment and moved their positions based on the user’s motions. The users made their activities within the 15 m experiment area. We also used the IMU and stretch sensors and recorded the sensor reading from the users’ activities during the experiment time. Our conventional HAR approaches use the sensor reading for activity recognition, and we compared these HAR results with our HIT machine approach. Figure 6 shows the smartphones, IMU and stretch sensors, and experiment area involved in the HAR data collection. Table 2 summarizes our system configurations and hyperparameters used for model training and testing.

We started the analysis of the HIT machine by implementing deep learning models, such as VGG, Inception, ResNet, and EfficientNet. We tested these models with our HAR dataset, and Figure 7 shows the classification results from each model. We used confusion matrices to analyze each model, summarizing the classification performance. The color bars indicate the number of samples populated in a specific area. When the data samples are higher, the color becomes lighter and vice versa. The results observed in confusion matrices show that the ResNet architecture has the highest classification performance compared with other models and achieved a 98.53% model accuracy, 0.20 model loss, 98.56% precision, 98.53% recall, and 98.54% F1 scores. The VGG model reached 96.38% model accuracy with 0.09 model loss, 96.58% precision, 96.38% recall, and 96.36% F1 score as shown in Figure 7(a). The VGG model has a higher classification accuracy for sitting, sit-up, standing, and walking activities. The model has the highest misclassification error for running. Some of the running activity is misclassified as walking. Figure 7(b) shows the classification results from the Inception model. This model achieved a 93.18% classification accuracy with 0.13 model loss, 93.18% precision and recall, and 93.11% F1 scores, which are worse performances than the results obtained by the VGG model. Furthermore, Figure 7(c) shows the best classification results from our HIT machine based on ResNet architecture. The ResNet architecture showed the best model accuracy with the least classification errors. However, the model loss is higher than other models and needs higher computation time than VGG and Inception models. This model maintains the classification accuracy for basic and complex activities, and the model is the best choice for HIT machine-based activity recognition. Figure 7(d) shows our last deep learning model results from EfficientNet. The EfficientNet reached 89.94% for classification accuracy with 0.21 model loss, 90.19% precision, and 89.94% recall and F1 score, which has worse HAR performance than VGG, Inception, and ResNet models. The higher level of classification error from EfficientNet shows that this model is unsuitable for our HIT machine-based activity recognition. Figures 8 and 9 show the deep learning models accuracy and loss plots, and Table 3 summarizes their performance.

In Table 3, we used the accuracy, loss, precision, recall, and F1 score parameters for performance evaluation. The following equations from [98] define these parameters.where the variables TP, TN, FP, and FN are defined as true positive, true negative, false-positive and false-negative in a given experiment. In the loss function, is the scalar value in the model output, is the corresponding target value, and the output size is the number of scalar values in the model output. From the results in Table 3, the ResNet architecture outperforms the other deep learning models with an average value of 98.53%. These results indicate that the system trained with the ResNet model is the best choice for activity recognition.

When we consider the training time results from Table 3, it shows that the ResNet-based HAR approach has a higher training time (600 s) than other models. This is due to the deep architecture of ResNet, and the system takes more time to train the model. However, the activity recognition results from ResNet compensate for the training time when considering the overall system performance (6.44% of classification improvements than EfficientNet model-based HAR system). In the case of VGG model-based HAR, the system achieved the most down training time (240 s) compared with other models and reached good classification results for activity recognition. The Inception model-based HAR system has a 60 s time difference for model training compared with the EfficientNet model. The EfficientNet has a lower training time (300 s) than the Inception model-based (360 s) HAR system. However, the EfficientNet-based HAR approach shows worst classification results than other models.

To further validate our HIT machine performance, we tested the pre-trained deep learning models with unseen HAR datasets. We collected another set of HAR datasets and tested them with our pre-trained models. Table 4 summarizes the results for unseen datasets from pre-trained models. The results in Table 4 show that the ResNet architecture achieved 72.13% for classification accuracy with 72.25% precision, 72.92% recall, and 72.95% F1 scores. These results outperformance the other pre-trained models. However, the computational complexity of this architecture makes it more practically challenging for real-time HAR applications. The classification accuracy from the Inception model shows that the model reached 65.17% for classification accuracy with 65.21% precision, 65.08% recall, and 65.59% F1 scores. The results from the Inception-based pre-trained model give better results than VGG and EfficientNet pre-trained models. In the case of VGG based pre-trained model, the system shows 61.85% for classification accuracy with 61.73% precision, 61.48% recall, and 61.47% F1 scores. The EfficientNet pre-trained model-based HAR system shows 57.42% for classification accuracy with 57.48% precision, 57.43% recall, and 57.72% F1 scores. These results show the worst classification results compared with other pre-trained models, and the approach is unsuitable for image-based HAR systems.

Next, we validated our HIT machine results with sensor-based HAR approaches and image-based HAR without HIT machine. Figure 10 shows the classification results from our HIT machine, HAR without HIT machine, and sensor-based techniques. This analysis uses a 2D CNN model for activity recognition. The CNN model is computationally lighter than other deep learning models and easily fits IMU and stretch sensor datasets. Figure 10(a) shows the classification results from IMU sensor-based HAR approach. The results show that the IMU sensor approach reached 90.71% of classification accuracy with 0.27 model loss, 90.47% precision, 90.71% recall, and 90.00% F1 scores. The activities that include running, sitting, sit-up, standing, and walking have higher classification errors due to the similarities of IMU sensor data. The model fails to classify these activities, increasing the classification errors in the HAR system. When the system uses a stretch sensor instead of an IMU sensor, the classification performance has a 3% improvement. The stretch sensor-based HAR system achieved 93.80% of classification accuracy with 0.27 model loss, 94.16% precision, 93.80% recall, and 93.20% F1 scores. Figure 10(b) shows the classification results from stretch sensor-based HAR approach. The stretch sensor data are more stable than the IMU sensor and have accurate HAR results. The activities that include sitting and walking have higher classification errors than the IMU sensor-based approach. The stretch sensor-based HAR approach is reasonable if the system cost is not a primary concern. The prohibitive cost of the stretch sensor makes the system more challenging for practical health care applications. Next, we analyzed a HAR approach that uses image data without a HIT machine. Figure 10(c) shows the results from a HAR without HIT machine. The HAR system without HIT machine reached 90.98% of classification accuracy with 0.20 model loss, 91.24% precision, 90.98% recall, and 90.90% F1 scores. These results indicate the significance of the HIT machine. Compared with the results from Figure 10(d), the system without a HIT machine has a higher classification error and shows the worst performance for both basic and complex activities. The results from Figure 10(d) show the classification performance of the HIT machine, which has the best performance compared with other HAR approaches. The system achieved a 6.01% accuracy improvement compared with the IMU sensor-based approach and 2.4% accuracy improvement compared with the stretch sensor-based approach. The system also has a 5.3% accuracy improvement compared with the HAR approach without HIT machine. Our proposed HIT machine-based HAR system show 96.28% of classification accuracy with 0.09 model loss, 96.26% precision, 96.28% recall, and 96.27% F1 scores. Table 5 summarizes the performance of each approach in terms of accuracy, loss, precision, recall, and F1 score. From Table 5 results, the HIT machine shows the highest classification results than the sensor-based and without HIT machine-based HAR approaches. The results indicate the impact of the HIT machine-based activity recognition for complex activities.

The training time results from Table 5 indicate that the stretch sensor-based HAR system shows the best training time (120 s) than the other HAR systems. This is due to the small number of data samples from the stretch sensor dataset. In the case of the IMU sensor-based HAR approach, the system has a 300 s training time, which is 180 s higher than the stretch sensor-based HAR approach. Also, the classification accuracy from the IMU sensor-based HAR approach is 3.09% lower than the stretch sensor-based approach. The proposed HIT machine-based HAR approach shows 340 s training time, which is lower than HAR without HIT machine-based approach (480 s). The training time results from our proposed HIT machine indicate that the approach reduced 140 s of training time compared with HAR without a HIT machine-based approach.

From the experiment and result analysis, it can be seen that the HIT machine-based HAR approach has a significant role in activity recognition. The proposed HAR system addresses the primary vision-based HAR system’s challenge, such as processing high-quality images. We used image cropping, resizing, and data cleaning to make the system can perform the high-quality images without compromising the classification results. Our system takes advantage of the mask R-CNN algorithm, which is computationally lighter than other vision algorithms. The proposed method also solves the camera viewpoint and background clutter issues by considering the smartphone camera’s wide-angle feature. The classification results from the HIT machine show that the proposed HAR approach is a valid method for healthcare applications, including abnormal activity detection, elderly care in homes, and disabled assistance. The extended versions of HIT machines are helpful in other applications, including intelligent environments, indoor navigation [99], security and surveillance, and people monitoring [100].

5. Conclusion

This paper proposed a HIT machine-based HAR system for healthcare applications. The proposed HIT machine approach effectively utilizes the advantages of the mask R-CNN for human body estimation and enhances the performance of the HAR. The classification results from our experiments indicate that the proposed HIT machine has better classification results than conventional sensor-based HAR approaches. The traditional sensor-based HAR systems are not free from sensor errors, showing very poor classification results for complex activities. The proposed HIT-based HAR system is suitable for basic and complex user movements and maintains its classification accuracy in all user motions. Our HAR classification results and analysis show the influence of the HIT machine for activity recognition. The proposed HIT machine-based HAR system is a suitable healthcare option if HAR systems use a camera as their input device. We validated our proposed HIT machine-based HAR system for human activity recognition through extensive experiments and analysis. To improve the classification performance, we intend to use a sensor fusion technique that combines the image and sensor data for activity recognition in our future work. Furthermore, we will consider the most popular public datasets (UCI- human activity recognition using smartphones dataset) for future research and compare our HAR datasets’ performance with public datasets.

Data Availability

The data used to support the findings of this study have not been made available because of the privacy of the research participant.

Conflicts of Interest

The authors declare that they have no conflicts of interest or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2021R1A6A1A03043144).