Abstract

Human activity recognition (HAR) is a fascinating and significant challenging task. Generally, the accuracy of HAR systems relies on the best features from the input frames. Mostly, the activity frames have the hostile noisy conditions that cannot be handled by most of the existing edge operators. In this paper, we have designed an adoptive feature extraction method based on edge detection for HAR systems. The proposed method calculates the direction of the edges under the presence of nonmaximum conquest. The benefits are in ease that depends upon the modest procedures, and the extension possibility is to determine other types of features. Normally, it is practical to extract extra low-level information in the form of features when determining the shapes and to get the appropriate information, the additional cultured shape detection procedure is utilized or discarded. Basically, this method enlarges the percentage of the product of the signal-to-noise ratio (SNR) and the highest isolation along with localization. During the processing of the frames, again some edges are demonstrated as a footstep function; the proposed approach might give better performance than other operators. The appropriate information is extracted to form feature vector, which further be fed to the classifier for activity recognition. We assess the performance of the proposed edge-based feature extraction method under the depth dataset having thirteen various kinds of actions in a comprehensive experimental setup.

1. Introduction

The extensive applications of human activity recognition (HAR) in real-time surveillance, athletics, human to human communication, and healthcare have illustrated significant effects in rearing the quality of human lives. In the domains of telemedicine and healthcare, the remote analysis of the patients’ activities and well treatment might provide the innovation in the development of such domains. This type of development trains the physicians in their corresponding decision-making procedures through remotely watching the activities of the patients, specifically the stroke patients [1]. This process describes a case study for monitoring the critical situations of the stroke patients in telemedicine. The condition of the stroke patients might be explained through the methods of activity recognition systems, which might help the experts to provide the daily recommendations to the patients. The activities like jogging and walking might be recognized for the stroke patients, and hence the recommendations might be provided remotely for the better treatment. Similarly, the psychoanalyst may utilize the technology of telecare for treating a patient with upcoming-distressing anxiety illness (UDAI) by remotely watching the exercises [2]. The methods have been developed to watch heart condition of the patients through telecare [3], which is a major application for the actions analysis. These innovations enable the experts and doctors to professionally watch and manage the diseases of the patients.

Generally, HAR systems are recognizing the corresponding actions performed by a human through the collected data from various resources. In HAR, activities based on 2D vision are affected by illumination, obstruction, and shadow in real-time domains [4]. Wearable sensor-based human activity recognition has significant accuracy and instantaneous performance [5]. However, the collection of human actions through accelerometers and gyroscopes decreases the ease of human body and terminates the spontaneity of human computer interaction [6]. Currently, in telemedicine and healthcare, domains such neutral knowledge along with video technologies are employed that rise concealment issues. Then, it might lead to conditions where the humans might not recognize that their isolated data has been propagated and consequently convert exposed to a risk [7]. Kinect depth camera only captures the depth data and does not expose the identity and other sensitive information of the human in healthcare domains. This characteristic makes the depth camera a best choice against RGB camera, and thus, we select the depth camera instead of utilizing RGB camera for the proposed methodology.

The recognition of human activities might be categorized in two categories: frame-based recognition and sequence-based recognition. In frame-based recognition, only the extant frame is utilized with or without the standard frame to identify the human activities from the inward video frames, while in sequence-based recognition, the regular motion of the feature points is considered amongst the current frame and the initial frame. Hence, the frame-based recognition might not have the capability accurately recognize the human activities; therefore, we focused on the sequence-based recognition in this work. Generally, the HAR systems have four stages: segmentation, feature extraction, feature selection, and recognition. A huge amount of work has been done for segmentation and recognition; however, very limited approaches have been proposed on feature extraction and selection. State-of-the-art methods are discussed in [810] for HAR system, where the authors proposed a new feature extraction method based on Haar wavelet transform to create the feature vector. Then, they utilized k-NN for recognizing the activities, and they claimed best accuracy. However, Haar wavelet transform has a technical limitation, which is not continuous and hence is not distinguishable [11]. Moreover, if the corresponding data is large, then the observation step of k-NN can be slow, which is one major limitation of k-NN. Similarly, activity recognition frameworks were developed by [1214] that were based on temporal matching method in order to build the temporal features, which further define the difference of motion sequences from various time periods. However, most of these systems are too complex and inefficient for comprehensive dataset, due to which these systems dropped the classification rates.

Hence, in this paper, we have designed an adoptive feature extraction method for HAR systems. The proposed method calculates the direction of the edges under the presence of nonmaximum conquest. The benefits are in ease that depends upon the modest procedures, and the extension possibility is to determine other types of features. Normally, it is practical to extract extra low-level information in the form of features when determining the shapes, and to get the appropriate information, the additional cultured shape detection procedure is utilized or discarded. As illustrated, the activity frames are very noisy, and before the analysis, it needs preprocessing; therefore, we used histogram equalization for this step. Moreover, the whole frames consist of features named sludge that may be noticed for various edge operators to extract the edge, though without considering noise. The feature vectors for the proposed methodology have been generated using hysteresis thresholding, where the thresholds were selected manually for the best performance. In the proposed methodology, the elementary factors nicely reply to the noise, and it is hard to choose a threshold which discloses a main portion of the sludge boundary. Basically, this method enlarges the percentage of the product of the signal-to-noise ratio (SNR) and the highest isolation along with localization. During the processing of the frames, again some edges are demonstrated as a footstep function; the proposed approach might give better performance than other operators. The appropriate information is extracted to form feature vector, which is further fed to the classifier for activity recognition. We assess the performance of the proposed edge-based feature extraction method under the depth dataset having thirteen various kinds of actions in a comprehensive experimental setup.

The structure of the remaining paper is as follows. Some literature reviews are well described along with their shortcomings in Section 2. The proposed method is explained in Section 3, while the results along with some discussions are presented in Section 4. At the end, the conclusion of the paper is indicated in Section 5.

Recently, activity recognition includes understanding of human movements against a succession of frame observations regarding ecological situations. The newly established commodity depth cameras dig out innovative potentials of dealing with such concern but provide some sole challenges as well. A state-of-the-art feature extraction method was designed by [15] for condensed system, which was based on spatial and temporal information. Further, they utilized k-NN for the classification, and they claimed higher performance accuracy. However, this system is complex and inefficient for comprehensive dataset, due to which this system dropped the classification rates, and that is one of the common limitations of spatial and temporal based features. Moreover, if the corresponding data is large, then the observation step of k-NN can be slow, which is one major limitation of k-NN [16]. Similarly, the authors of [17] developed a psychology-enthused twin stream gated recurring unit approach for activity classification that was based on human body joints. Their approach achieved an 89.97% classification accuracy. However, their approach might not be applicable in healthcare domains because of low accuracy.

A depth map-based HAR system was designed by [18, 19] that utilized convolutional autoencoder neural network in order to train and learn fixed features and summarize the gratified of single depth maps. However, convolutional neural network is not appropriate for naturalistic domains. Moreover, computational-wise, it is much expensive and needs special device for training [20]. Similarly, an ensembled method was developed by [21, 22] for HAR systems, which was based on graph network for human skeleton. The system utilized information fusion, deep learning, and temporal information in order to extract the best features. However, for data fusion, it is hard to establish applications and data in a precise level of the system and to reprocess it [23], also it is improper for human viewpoint [24]. Moreover, temporal-based information is inefficient for comprehensive dataset, due to which this system dropped the classification rates.

On the other hand, an integrated descriptor-based method was designed by [25] for HAR system. In this system, for feature extraction, the concept of multilevel fusion was utilized against depth frames to devise a comprehensive learning system. However, multilevel fusion-based feature extraction has some common issues like feature mismatch, robustness to camera sensors, letdown, and susceptibility to unpredictable noise or meddling because the distinction compassion greatly decreases the significant of multilevel fusion-based methods [26]. Furthermore, finding optimal features and feature extraction methods require extensive domain knowledge, which is time-consuming [27]. In the approach of [28], the authors improved the gesture, and still history frames are calculated, which further helped in the extraction of gradient feature [28]. However, one of the major limitations of gradient-based feature extraction technique is the existence of numerous local optima, resultant in explanations where global optimum might not simply be guaranteed [29].

An efficient approach was proposed by [30] for HAR under three-dimensional skeleton information. In this approach, they simply employed a straight-forward deep learning model for activity classification. However, one of the major problems in skeleton-based feature extraction method is that the alone sparse skeleton data might not enough to completely classify the human activities [31]. Similarly, an improved bag-of-visual-words based activity classification method was designed by [32], where the authors employed support vector machine for classification. However, in this model, throughout the feature detection process, a huge number of key points are positioned that affect the performance of the model, due to which computational-wise, it is also more expensive [33]. Also, support vector machine is a vector-based classifier, which cannot classify the sequence-based activities. An adoptive state of the art approach was presented by [34] that was three-dimensional autocorrelation features. Moreover, they described the sequence of depth movement maps in order to get the data of temporal movement that might differentiate identical activities. However, this method is pretentious by affinal transformations that needs extra progresses to resolve 1 more complex activity recognition [35].

Therefore, the proposed method calculates the direction of the edges under the presence of nonmaximum conquest. The benefits are in ease that depends upon the modest procedures, and the extension possibility is to determine other types of features. Normally, it is practical to extract extra low-level information in the form of features when determining the shapes, and to get the appropriate information, the additional cultured shape detection procedure is utilized or discarded. Basically, this method enlarges the percentage of the product of the signal-to-noise ratio (SNR) and the highest isolation along with localization. During the processing of the frames, again some edges are demonstrated as a footstep function; the proposed approach might give better performance than other operators. The appropriate information is extracted to form feature vector, which is further fed to the classifier for activity recognition.

3. Proposed Edge Detection Algorithm

The analysis of Taylor series tells that head-to-head pixel’s differencing delivers the estimation of the first-order derivative at a pixel. If the variation is considered, then the pixels are divided by ; then, by Taylor extension, is

By reordering, the term is given as

These indicate the variations among the head-to-head pixels, that is, the estimation of the first order derivative along with error , which relies on the dimension of and the intricacy of the edge. If is high, then the error may be high. In training, the rapid selection of image points and the condensed high pitch content consider this assumption suitable, which is equal to calculating the difference of the first order that is carried out at two head-to-head pixels, which is a difference at horizontal such as , which is given as follows:which is equal to integrating space to perceive the edges by

Again, in order to analyze the Taylor series, we extend as

By substituting equations (1) and (5), we achieve the following:

Equation (6) recommends that currently, the estimation of the first order variation is the variation among the pixels detached by a single point along with . If , which is obviously lesser than the integrated error of head-to-head pixels (as shown in equation (6)), which further can be utilized by noise or error reduction as

M is the length between the edge direction and vector. During implementation, those templates which provided the highest value are stored as the edge value at that pixel. Then, the pixel of the edge is the highest for two values originated through the convolution of the two image templates at point .

Another way of considering the highest value is only the summation of results of the two templates to associate the edges along horizontal and vertical. There are various diversities of the edges that mostly be utilized to take the two templates as one of the facilitating mechanisms to build an edge vector along horizontal axis and vertical axis, respectively. Edge detection is a kind of mechanism to differentiate the two things, as it perceives variations, which must reply to noise and step-like image intensity variations in image intensity. Therefore, it is practical to integrate the average in the process of edge detection. Hence, we may spread the horizontal (Mi) and vertical (Mj) templates along three rows and columns, respectively, which provides two types of results such as the brightness on every axis and magnitude of the edge; θ is the vector angle, which is explained as follows:

Mi and Mj might be utilized to find the suitable quadrangle for the direction of the edge. In the proposed methodology, we also utilized Sobel operator, which employs two windows in order to find the edges in the form of vectors, and it is one of the most well-known edge detection operators. Moreover, the proposed method along with Sobel operator showed better performance against other concurrent operators. The proposed operator also considers the optimum averaging procedures and differencing. The Gaussian averaging has previously quantified to provide the optimum averaging. The binomial theorem provides a series of integer coefficients in order to approximate the normal distribution. As described before, we are utilizing two windows that provide us two sets of coefficients in the form of triangles as shown in Figure 1.

Figure 1(a) provides the irregular coefficients of an optimum discrete sharpening operator, which is basically a Gaussian filter along with integer coefficients. In Figure 1, the increasing coefficients for the window is given by the rows. The coefficients of sharpening inside the Sobel operator are of size 3 × 3. Moreover, Figure 1(b) describes the Pascal triangle coefficients for the subtraction, which might be executed by deducting the templates derived from the corresponding head-to-head extensions for the size of minor mask. Therefore, we need a filter that might offer the Pascal triangle coefficients for window parameters such that size is and location is . The filter is the as shown in the following:

There are four possibilities for the edge direction measurement delivered by the Sobel operator, as described in Figure 2.

In Figure 2, the reversing template of Mi does not specify the disjointedness at the turnings, which means the magnitude of the edge under the presence of the Sobel employed to the square is not presented but is comparable to those operators originated by the applications of other operators. If we change the templates of the Sobel filter, the edges’ direction measurement is organized and may be the normal edges by itself. But when the edges are to be determined, the reorganization might help in the construction of the algorithm to find the target. So, for this purpose, if an algorithm is locating the shapes, then it must use the direction of the edges for the accurate preparation. This procedure might improve the performance of the algorithm, but it must precisely map the corresponding image data. Once all the edges and their corresponding directions are detected, then the entire information has been stored in the form of feature vector, which is further utilized in hidden Markov model for activity recognition.

4. Algorithm Evaluation

We evaluated the proposed methodology against the following flow.

4.1. Arrangement for Experiments

We performed the following experiments in order to judge the performance of the proposed methodology:(i)The first experiment indicates the accuracy of the proposed methodology against the defined dataset. An n – 1-fold cross justification scheme has been used for the entire experiments, which means that the data from each subject will be utilized at least once for training and testing in order to maintain the robustness.(ii)In the second experiment, comprehensive experiments are presented instead of using the proposed methodology, which means that we will perform multiple experiments with recent feature extraction methods but will not utilize the proposed feature extraction method.(iii)In the final experiment, state-of-the-art methods are compared with the performance of the proposed methodology.

4.2. Depth Images Dataset

This defined dataset has a sequence of 670 videos, which were collected under the setting of Kinect depth camera. There were a total 70 subjects (university students: male and female) that performed thirteen activities such as bending, jacking, place-jumping, running, side movement, skipping, walking, one-hand waving, two-hand waving, jumping, clapping, boxing, and sitting up and down. In order to make the dataset more realistic, we recorded some videos from stroke patients in real healthcare domain. All the activity frames have various sizes; therefore, we have reduced the sizes of the entire frames to 100 × 100 in order to sustain the normalization. The dataset was recorded in 6-month period (from February 2017 to July 2017).

4.3. Experimental Results and Discussions

In the first experiment, we presented the accuracy of the proposed methodology against the depth dataset, which are performed in MATLAB using offline setting in lab. The whole results are represented in Table 1.

The proposed methodology showed significant accuracy, as presented in Table 1. This significant performance is of the calculation of the edge directions under the presence of nonmaximum conquest. Also, during the processing of the frames, some edges are demonstrated as a footstep function; the proposed approach might give better performance than other operators.

In the second experiment, the accuracies are represented against various latest feature extraction methods of machine learning. However, in these experiments, we will not use the proposed methodology. For these group of experiments, we employed different types of well-known feature extraction techniques such as autoencoder, histogram of oriented gradients, contrast features, ellipse features, Fourier features, Gabor features, Haralick texture features, geometric features, local binary pattern features, and basic intensity features. The whole results for this group of experiments against depth dataset are described in Tables 211, respectively.

As demonstrated in Tables 211, the proposed activity recognition system did not get best classification rates with latest feature extraction techniques. On the other side, the system achieved significant accuracy with the proposed feature extraction technique. This is because the proposed method calculates the direction of the edges under the presence of nonmaximum conquest. The benefits are in ease that depends upon the modest procedures, and the extension possibility is to determine other types of features. Normally, it is practical to extract extra low-level information in the form of features when determining the shapes, and to get the appropriate information, the additional cultured shape detection procedure is utilized or discarded.

In the last experiment, the accuracy of classification for the proposed methodology is compared with the state of the art HAR systems. In the existing systems, some of the systems are executed for these experiments, while for some systems, we borrowed their simulation, and for some systems, we took the results from their corresponding papers. All of the systems are implemented against the depth dataset with the exact sceneries as presented in their respective works. The whole comparison results are indicated in Table 12.

As demonstrated in Table 12, the proposed methodology achieved the highest accuracy on depth dataset compared to other state-of-the-art systems. This is because in the proposed method, during the processing of the frames, some edges are demonstrated as a footstep function; the proposed approach might give better performance than other operators. Also, some appropriate information is extracted to form feature vector, which is further fed to the classifier to get best accuracy of classification.

5. Conclusion

Human activity recognition states the task of evaluating the physical activity of a person over the usage of neutral knowledge. Currently, in telemedicine and healthcare domains, such neutral knowledge, along with video technologies, is employed, which rise concealment issues; then, it might lead to conditions where the humans might not recognize that their isolated data has been propagated and consequently convert exposed to a risk. Therefore, in this paper, we have designed an adoptive feature extraction method for HAR systems, which utilized Kinect depth camera in order to resolve the privacy issue. The proposed method calculates the direction of the edges under the presence of nonmaximum conquest. The benefits are in ease that depends upon the modest procedures, and the extension possibility is to determine other types of features. Normally, it is practical to extract extra low-level information in the form of features when determining the shapes, and to get the appropriate information, the additional cultured shape detection procedure is utilized or discarded. The benefits are in ease that depends upon the modest procedures, and the extension possibility is to determine other types of features. Normally, it is practical to extract extra low-level information in the form of features when determining the shapes, and to get the appropriate information, the additional cultured shape detection procedure is utilized or discarded. As illustrated, the activity frames are very noisy, and before the analysis, it needs preprocessing; therefore, we used histogram equalization for this step. Moreover, the whole frames consist of features named sludge that may be noticed for various edge operators to extract the edge, though without considering noise. The feature vectors for the proposed methodology have been generated using hysteresis thresholding, where the thresholds were selected manually for the best performance. In the proposed methodology, the elementary factors nicely reply to the noise, and it is hard to choose a threshold which discloses a main portion of the sludge boundary. Basically, this method enlarges the percentage of the product of the signal-to-noise ratio (SNR) and the highest isolation along with localization. During the processing of the frames, again some edges are demonstrated as a footstep function; the proposed approach might give better performance than other operators. The appropriate information is extracted to form feature vector, which is further fed to the classifier for activity recognition. We assess the performance of the proposed edge-based feature extraction method under the depth dataset having thirteen various kinds of actions in a comprehensive experimental setup. The proposed methodology achieved the highest accuracy of classification compared to the existing latest HAR systems.

In the future, we will try to improve the proposed methodology in the telemedicine and healthcare domains in order to sustain the same accuracy and facilitate the physicians for better recommendations. Moreover, our alternative goal is to deploy the proposed methodology in smart phones. So, for that purpose, we will try to propose a light weight classifier coupled with the proposed approach to implement it in the smart phone for real healthcare domains.

Data Availability

The data used for this study and simulation will be provided on demand.

Conflicts of Interest

The authors declare that they have no conflicts of interest to report regarding the present study.

Acknowledgments

This work was funded by the Deanship of Scientific Research at Jouf University under Grant no. DSR–2021–02–0346.