Abstract

This paper presents a new method to recognize human activities based on weighted classification for the features extracted by human body. Towards this end, new features depend on weight taken from image or video used in proposed descriptor. Human pose plays an important role in extracted features; then these features are used as the weight input with classifier. We use machine learning during two steps of training and testing images of standard dataset that can be used during benchmarking the system. Unlike previous methods that need size or length of shapes mainly to represent the cues when machine learning is used to recognize human activities, accurate experimental results coming from appropriate segments of the human body proved the worthiness of proposed method. Twelve activities are used in challenging of availability comparison with dataset to demonstrate our method. The results show that we achieved 87.3% in training set, while in testing set, we achieved 94% in terms of precision.

1. Introduction

In the era of Information Technology (IT), recognition of human activities plays a paramount role in many applications and study of computers. Recently, machine learning has aimed to take the features of the human body, analyze them, and then extract the informative data that achieve the reality of activities. Each human activity has certain features that indicate its type. Therefore, automatic learning depends mainly on the weight of these features in order to classify these features and thus obtain right detection of the activity.

Human Activities Recognition (HAR) using video and images has been widely carried out especially due to flexibility [1]. Anyway HAR in still image got less attention because of missing the prior information during processing and nature of the image. Hence, the image has fixed and rigid information difficult to deal with [2, 3]. The information needed to process any HAR system is extracted from both image and video, and the preprocessing needs to be done before implementation to get prior knowledge considering features extracted in advance. However, ground truth related to HAR considers both video and image and the challenging one is the image; and as known video consists mainly of a sequence of images, predata and postdata are provided for extraction. The main goal of HAR system that uses image or video is to develop an automatic rendering system of gathered data to precisely determine the behavior of the man. HAR is useful in a wide range of applications, especially when appropriate implementation is suitable for human computer interfaces [4]. HAR scheme is mostly used in applications of using surveillance security system in terms of automation.

Every HAR system needs a classifier to classify the features for activity detection, and SVM is one of the best classifiers in this regard due to its easiness to use and reliability that can improve it to be suitable for the proposed method.

Systems that deal with pattern recognition must consist of inevitable stages like preparation stage (noise reduction and enhancement), segmentation (or partitioning), feature extraction, and classification [5]. In image, noise removal is necessary, and many techniques suggested in literature such as fuzzy system are related to image and are worth addressing [6]. Segmentation is also useful for segmenting images in video or the images themselves, machine learning depends on the features extracted from the human body, and, in order to extract the data, we must first segment the human image because the segmentation gives the system a clearer view that the machine learning algorithm understands. [7]. Machine learning should extract important features that are fed to the classifier, which would be data estimated for machine during learning; then another inheriting subinformation within will be derived, [8] actually selecting best features considering the main issue to control HAR system and make the system useful or not [9, 10].

Recently, many classifiers have been suggested and developed accurately for real-time applications, achieving good results [11]; the majority of these classifiers were developed mathematically and the famous one is Support Vector Machine (SVM) which will also be developed in depth in this paper. In addition to several classifiers such as K-Nearest Neighbor (KNN) and Linear Discriminant Analysis (LDA) which perform good classification of human activities [12, 13], real-time applications need high response time to be suitable; few classifiers used large features for their processes to achieve accurate results. Generally, the system that deals with pattern recognition or classification consists of three main steps: preprocessing, feature extraction, and finally classification by suitable classifier [10, 14].

The purpose of classification is to recognize the human activities behavior; in this regard understanding movement of human is necessary for understanding and executing all activities. HAR system should take the real analysis of each movement with its response. Sometimes the shape of human body takes shapes of feature not easy to understand due to similarity of two activities like walking and running, where human body takes most likely same position and shape. Figure 1 depicts the visualized the activities by each action like running, walking, and other activities, so one part of human body is easy to detect, while other parts are difficult to detect for some activities. Some motion activities are difficult to detect by a special domain like smiling, crying, or sadness; thus a frequency domain (DT or CWT filters) is needed for feature extraction [15, 16].

Video and static image are the two main domains focused on in this paper; the challenging one is still image; therefore, the proposed method will contribute to both video and image. Due to the fact that the recognition of activity from single or still image actually somewhat has ignored prior knowledge, the results will be harder. For this, existing methods were improved and contributed to this regard [17]. Extracting feature for classification of human activities can be both hardware and software. Hardware used wearable sensors to detect the activities [18, 19], while software used extracting features within programming tools. Figure 2 illustrates the two different domains mentioned above.

Recently, video surveillance camera has been widely used outdoors and indoors for security reasons in many fields: industry, medicine, education and so forth, and the camera-produced video needs to be processed outside the circuit to handle it [20]. Most of images need to reduce noise, which is considered as preprocessing stage to prepare for extracting useful data from it; many algorithms for noise reduction were suggested in literature [21] and used to prepare the image without contribution.

HAR is considered as an important topic; therefore many surveys have been introduced and published [9, 22]. Some survey papers are categorized as classifying the activities and clustering, while the others consider features extracted with their types; anyway most of them introduce detailed HAR system as a big issue, since HAR system is complex due to including many necessary stages. The most important studies in literature consider the applications of HAR system and where we can apply it in different life fields [23].

HAR system consists of three main stages: preparation stage, feature extraction stage, and classification stage. Deep learning is widely spread nowadays and has become necessary due to its use in many modern applications. HAR system used within deep learning algorithms such as any system has to learn the machine to have an effect on the human jobs and aims to use the source of computational machine perfectly. Prediction of future results is needed because the parameters of each system increased and became necessary to be controlled by deep learning such as traffic flow congestion [24, 25]. Automated behavior of machine has become worthy in this world, so machine learning is important here, which coincided with the economic development of some countries like China [26].

2. Literature Review

Many researchers proposed different techniques in literature. We present the most important studies in the field of machine learning. Complex features are extracted by using end-to-end machine learning and multimodal temporal fusion in smartphone [27]. HAR method based on wearable sensors was suggested in [28] and used stacked autoencoder to extract powerful features from human body. Mobile edge computing in video but used frame by frame as image, sensors of multifunctional simulated by smart phone to gather information from human body to recognize the activities in medical issue [29]. Monocular images extracted from video in 3D and 2D poses to verified from which activity was presented in [30]; the authors used two high parameters in still and video images. Random forest model was presented to enhance the deep learning, and 40 activities were recognized with good performance of HAR system [31].

3. HAR System in Machine Learning

Acquired data were collected from various types such as embedded and vision sensors; data will be analyzed in the preprocessing stage. Information coming from devices is classified into two categories; data comes from vision sensors like camera or wearable sensors; there is a difference in preprocessing between each category; data from sensors need to reduce noise and enhance the image, while segmentation was required for the data that come from camera [13]. Simple architecture illustrated in Figure 3 shows the main data entry to the system with related preprocessing action.

3.1. Collecting Data

Information dealing with HAR system consists of two categories: the first is sensor based HAR system and the second is vision based HAR system. One can predefine a set of n activities or actions A; then the following can be noted:and multiple groups of sensors are used to measure the number of attributes S in k interval time series of I = [ta, ], so these activities can be yielded:

The purpose of the HAR system is to find partition of temporal of I such as [I0, I1,…,Ir−1] and list of attributes S; classes collection will describe activities during each time partition Ik, and the collection of time can be defined as

Vision based HAR depends on sensing device technology like CCTV and camera with recording capabilities to record activities of the human [32]. In this approach, it does not need wearable sensors or smartphone but depends mainly on image quality taken from camera. Quality of image is determined by image resolution, illumination change, and environmental lighting. Data collected behave like sequence of images or audio signal or vision of computer. After collecting data, they will pass through processes such as feature extracting, modeling with segmentation for activities, and then activities classification and tracking.

3.2. Machine or Deep Learning Based HAR

Signal processing is a technique that has been used for a long time to analyze collected data from sensors [33]. This approach is implemented to perform preprocessing and extract engineering features from video or image; next, extracted features are trained in machine learning (ML) algorithm for classification to decide the activity [26]. An easy method for classification using logistic regression to distribute the probability under the formula behaves the same as classification model:

Extracting feature engineering will be analyzed manually for selecting suitable set of features and then will be implemented to reduce feature space. Deep learning has become an interesting research topic in the past decades; then it will deal with human performance in different research areas including the HAR system. For traditional ML, normal extraction of features is performed in HAR system but with large datasets and real-time HAR system needs to improve and increase the efficiency. There are different terms and characteristics of machine learning and deep learning, which are presented in Table 1. The proposed HAR system relies on machine learning and its conditions and requirements are met [25].

Other classifiers are used in HAR system such as naïve Bayes but the ability to manipulate increasing of features is not good and the classifier itself is difficult to modify with new type of features. The decision tree classifier is not able to be adaptive with deep learning method due to the fact that deep learning is changing the process during running the system and adaptive processing is needed. The biggest issues with classifiers are the features extraction and what kind of processing these features require.

3.3. Learning HAR System

There has been an exponential growth in using the HAR system in the recent years; for getting more straightforward and efficient data process, learning algorithm was improved and developed involving collecting with analyzing data, pattern recognition, and useful features that are fed to the system of ML algorithm [34].

Many algorithms are introduced in literature in the field of ML and they are generally classified into the following categories:(1)Classification algorithm: represented by naïve Bayes and decision tree algorithm(2)Rule extraction algorithm: deals with prior knowledge algorithm(3)Clustering algorithm: such as K-means algorithm and EM algorithm(4)Special support vector machine: defined as SVM algorithm with its improvement(5)Neural network with hidden layers(6)Genetic algorithm

An important stage in HAR system classification in this article is using an improved SVM classifier to classify the activities. One of the advantages of using such classifier (SVM) is being flexible and developable [35]. Support victor classification gets vector training for two categories or classes and the vectors and , so we can define the primitive problem:

Actually, this vector has truth of and the training is on vector xi and it is mapped in .

To improve the SVM classifier, first there is a need to play with factors of the classifier and find the new path so the algorithm with required results can be adapted. There are linear and nonlinear classifiers; the simple linear SVM is defined as (xi, yi), where i = 1,…,N and xi = (xi1, xi2,…,xin) with attribute of i-th sample; if considering , which is class label, then the decision will be

Since is weight vector and b is the bias, when the training is linearly separated, coordinate (, b) will be defined asAnd the function of linear classifier is defined as follows:

So, during training for a given dataset to decide the i-th plane margin of sample xi when defining the plane (, b) can be defined as

Function margin given image from dataset during training to make decision boundary that consists of two lines that is gap called margin and width of this margin is controlled critically, where the width is defined as in equation (9).

Minimizing the margin makes the classifier more accurate and the classifier has to be optimized as given information below.

Control , b, and .

After minimizing, the function will be

3.4. Nonlinear HAR System

In the proposed method, there is a need to get the optimal hyperplane which is defined as maximizing the generalization. The problem is when the training gains the data separated not linearly which means when running the classifier linearly it cannot reach good result. Due to feature space which mapped as a high dimensional not allowed to perform the classifier well. For this reason, the proposed method provides efficient way to solve higher-dimensional space called kernel trick [36] or weighted features. When the linear system depends on dot product, develop classifier to be inner product to kernel functions for training data. In this regard, the proposed method is designed to find a1,…,an in the following equation:

Here is the maximization with respect to a factor, since

So the classifier can be applied by using the following equation:

For given images in dataset provided have some activities try to train the system using improved classifier to classify six activities. The system can run for more activities through training; then proceed with testing the system. Weighting features extracted are also used as the factor of classification .

Vector of weighted features can be variable and jumping among activities x. Weight in one activity differs from the others duo to the nature of the weight, with proposed method feature used as angles between human body and terminals like arms, legs, and so on.

4. Method and Discussion

In each HAR system for machine learning there exist three main stages, which are preprocessing, feature extraction, and finally classification which is our main concern. Images taken from different sources like camera or scanner are considered as input to the HAR system. Two interesting datasets are used such as INRIA and KTH getting from public domain of six activities which are walking, running, jogging, boxing, clapping, and waving. Images are normalizing the size as recognition rate estimation. Recognition rate can be calculated by the following formula:

Any HAR model consists of stages illustrated in Figure 4.

The system starts with image coming from standard dataset which contains more than 1580 activities, each with 9 actions. The proposed method analyzes the activities automatically via the designed algorithm to reduce effort of the human and get instant response.

Schematic of the proposed method is shown in Figure 4, which consists of preprocessing stage of segmentation of the image. This stage is sometimes called preparation stage to prepare the properties of given image to the next step; this stage must be taken into consideration as an important stage due to the fact that processing here decides whether the features are extracted properly or not. Any mistake in this process can disrupt the system and may not give accurate results. Some images from datasets come with noise or plankton; to get rid of this noise, a powerful expression like the one in equation (16) needs to be used.where Sxy represents neighbor pixels of image I’s size (M × N) for corresponding coordinate (x, y); Id represents remaining gray level value such as detected size.

Some images got noise and others not, but the majority are clear images; we apply noise reduction for integrity of the system. Another issue in preparation step is segmentation; in this process, there are two conditional actions; the first is background subtraction which means we extract the desired foreground object, in our case the human body, by background subtraction from the image to get human shape; this process is yielded when applying the expression in equation (17) to a given image.

Another segment type considers part of the image and relation of object (human body) with the image environment; this process is variable from one HAR system to another and most studies in literature contribute to this stage. Process in this stage provides action to next stage, which is feature extraction stage.

Feature extraction gains its importance from the fact that any classification depends on features extracted in addition to the fact that the classifier cannot reach any result without vectors of good features. Features are extracted from both the object itself and position of each part. Figure 5 shows how human body will be segmented within given image.

Information extracted from segment image is stored as vector of f(µ) = {µ1, µ2,…,µn}, where µ is the data extracted from one segment. What applies to one activity automatically applies to the other activities.

Other features extracted using triangle scan for all images are called triangle features, and triangle scan is applied flexibly and is resizable in addition to rotational case as shown in Figure 6.

Triangle can scan all the images starting from top left to down right with 360-degree rotation for each step. Other features are angles of vertical elevation of the object with the ground horizon line, angles of the triangle, and vertical elevation stored also in vectors for later classification with the classifier.

With the proposed method, novel features are extracted from human body and then their weights are used in nonlinear classifier such as improved SVM. Features control the results of classifier; when getting certain results, arrangement of features starts to change according to their weights, and changing the weights occurs during training iteration for classifier. During segmentation of the image into subsegment parts, the system detects whether there is any part of human body. As human body parts change their positions during activities, we can detect the new positions accordingly, as shown in Figure 7.

5. Results and Discussion

Two interesting datasets are used to evaluate the proposed method, which are obtained from public domain. Datasets used are KTH and INRIA, all with six activities, which are walking, running, jumping, boxing, waving, and clapping. Eight scenarios are contained inside the activities. The image used in the proposed method was first normalized to 64 × 128 pixels, with front side considered for waving and clapping, while for the rest of activities side view is considered. These six activities are chosen due to their popularity in static image of datasets.

Nonlinear SVM classifier is used to get highly accurate recognition rate when dynamic features are extracted from the image considered. Some activities like walking, running, and jogging get high results as regards recognition rate due to the fact that the features extracted as angles are affected directly with these activities. When the rest of activities get less recognition rate because of the nature of such activities, features gain less in such activities. Applicability of the proposed method is limited with online recognition rate and time computation through running which is longer than the others. This method can get accurate result in still image without online processing by using proposed features in terms of richness. The proposed method obtains a high recognition rate of 87.3%; it is the ratio of the correct classification to the wrong one and is considered as a good result in HAR system. In still images, this result is considered to be high due to the lack of reliable previous information in still images, and it is not like in video, where previous information can be relied upon, such as the previous frame, which will of course change the result.

Data in Table 2 are called confusion matrix of supervised method. Results depicted inside the confusion matrix show that some activities get more interest such as walking, running, and jogging because of being adapted with suggested features (angles), so the system is able to recognize them perfectly.

The proposed model within features extraction focuses on arms and legs with respect to the direction of the vertical body. In contrast, other activities such as clapping and waving focus on triangle derived from the arms and legs.

Training system with collected datasets for six activities reaches acceptable results when using improved SVM classifier as compared with existing methods.

The proposed method allows evaluating the system’s accuracy for six activities using two datasets evaluating stand with activities shared between these datasets. Training system got good response time but delay occurs when changing image to gray scale followed by noise reduction and segmentation. Training depends also on number of images inside the dataset; training will be with process samples of known activities identified by dataset; then when finishing the training, start with testing unknown images according to features vectors belonging to a particular class. Features are entered into the classifier by performing analysis in the following equation:where for all class i = 1,2, …, n.

Each xi in a vector of real n-dimensional and yi indicates whether the feature belongs to = 1 or else = −1, which means that two classes A and B in this case are categorized as separated data. During classification of machine learning, decision rule will created as to classify certain X belonging to one class. Observation of training set yields building rule in the training. Prediction stage, also named as testing stage, takes the given image and processes it to apply machine learning during training set. The system will compare the features extracted from the given image with the features extracted during training process to find the suitable match or closer counterpart to recognize the activity. Training and testing modes are presented in Figure 8.

A number of features suggested are considered as powerful features supported by HAR system. Most of study improve and enhance in this section because of this part still need to improve in additional to dataset used limit the feature used according to the nature of the object inside the image itself. Using accurate classifier also contributes to achieving good results in a way of adapting it with the proposed method.

6. Conclusion

We report a system of HAR machine learning that recognizes human activities such as walking, running, jogging, walking, clapping, and waving. The proposed system mainly consists of three main stages, which are preprocessing, feature extraction, and classification. Our contribution is to the last two stages for system integration. In terms of machine learning classifier considered as the heart of the HAR system, improvement in this context yields good results. New suggested features like angles derived from triangles and human body verticality with the horizon line play an important role in HAR system. The proposed HAR system achieved good results in terms of recognition rate by classifying the activities and accuracy, while real-time response was not considered in this study.

Data Availability

Two standard datasets were used in the proposed system, each with more than 890 available images, to recognize six activities. Two types of features were derived from the human body via angles used to run SVM classifier. Improving the mathematical issue of SVM classifier was helpful to improve the system. The image used in the HAR system uses normalization scale of 24 × 128 pixels.

Conflicts of Interest

The authors confirm that there are no conflicts of interest regarding the publication of this paper.