Abstract

In the area of recognition and classification of children activities, numerous works have been proposed that make use of different data sources. In most of them, sensors embedded in children’s garments are used. In this work, the use of environmental sound data is proposed to generate a recognition and classification of children activities model through automatic learning techniques, optimized for application on mobile devices. Initially, the use of a genetic algorithm for a feature selection is presented, reducing the original size of the dataset used, an important aspect when working with the limited resources of a mobile device. For the evaluation of this process, five different classification methods are applied, k-nearest neighbor (k-NN), nearest centroid (NC), artificial neural networks (ANNs), random forest (RF), and recursive partitioning trees (Rpart). Finally, a comparison of the models obtained, based on the accuracy, is performed, in order to identify the classification method that presents the best performance in the development of a model that allows the identification of children activity based on audio signals. According to the results, the best performance is presented by the five-feature model developed through RF, obtaining an accuracy of 0.92, which allows to conclude that it is possible to automatically classify children activity based on a reduced set of features with significant accuracy.

1. Introduction

Artificial intelligence is an area in which there have been significant advances in recent years, caused largely by the evolution of technology. In this area, some subareas have had a greater development, as is the case of environmental intelligence [1], through which human beings seek to develop solutions that facilitate the activities of daily life and the interaction with their environment.

In recent years, with the growing development of environmental intelligence, different solutions and applications have emerged focused on providing services in different scenarios such as smart homes [2], smart buildings [3], smart cities [4], and other scenarios where it is possible to automate some type of interaction between people and their environment. Many of the works developed in the area of environmental intelligence are based on the concept of recognition and classification of human activities, through which it is possible to detect the type of activity performed by a person or group of people and offer a specific service or solution. This area of recognition and classification of human activities has had a huge boom in recent years, and there have been many works developed [510].

Within the works on recognition and classification of human activities, an important aspect is the group of people towards whom they are focused since different aspects such as the way to obtain the data and the services offered, among others, depend on it. Some works are directed to elderly [11, 12], others to children [13, 14], and most to adults interacting with defined environments [10, 15].

Another important aspect to be considered is the data source to be used, and numerous works have been developed using sensors as a data source [16, 17]. In the area of recognition and classification of children activities, common data sources are video cameras, accelerometers, and radio-frequency devices [18, 19]. Most of the works presented in this area perform data capture by placing the sensors directly on the children’s clothing, which has the disadvantage that these sensors can interfere with children’s natural behavior and alter the activities that they usually perform and that is intended to analyze. Also, as mentioned in the work of Wu et al. [20], this approach can be conducted based on one of the main roles and opportunities in the information and communication technologies nowadays, sustainable development goals, where the sustainable development in the acquisition of data can be performed with the purpose of assisting all nations in the development of research for multiple purposes.

The use of a data source that does not interfere with the performance of the activities to be analyzed is an alternative for the use of embedded sensors. Sound has been used as a data source [21, 22] with the advantage that its capture does not interfere with the analyzed activities because it is not necessary that the capture device is embedded in the clothing, and it can be at some distance from the place where the activity is being done. In addition, it is not necessary to use sensors or special hardware since it is possible to directly use the microphone of mobile devices for capturing ambient sound. For a general overview of a big data system development, the work of Atat et al. [23], where the different stages, including data collection, storage, access, processing, and analysis, are presented, develops a panoramic survey of working with large amount of data that are generated through a network.

To perform the process of recognition and classification of children activities using sound, it is necessary to select the audio features to generate a model that, trained with a part of the dataset, is able to predict to which class a given audio belongs. For the generation of these models, the genetic algorithm, Galgo (R package) [24], is implemented, based on a set of algorithms used as classification methods. The present work implements the k-nearest neighbor (k-NN), nearest centroid (NC), artificial neural networks (ANNs), random forest (RF), and recursive partitioning trees (Rpart) classification methods, which are algorithms commonly used in recognition and classification of human activities [2529].

Galgo implementation was selected given the availability to select relevant features that model the behavior of a phenomenon, even when it was developed in big search and bioinformatics applications. This algorithm has proven useful in various types of signals, such as magnetic field [30], complex systems [31], and features extracted from medical images [32, 33], to mention some.

In the work presented by Blanco-Murillo et al. [34], the generation of models for classifying children activities through environmental sound with the aforementioned algorithms is shown. These models were generated with a 34-feature set extracted from the audios, and the classifier that achieves a higher precision is k-NN with 100%.

Previously, we presented a work where, with a larger dataset and the same set of extracted features, we perform a process of feature selection using the Akaike criterion, obtaining a 27-feature subset for the generation of the models with the same classification algorithms before mentioned, achieving accuracies greater than 90%, with a 20% reduction in the size of the dataset due to the feature selection process, with the extra trees classifier obtaining the highest precision of 0.96%.

In the present work, the same audio recordings and the same set of extracted features are used, but now performing a more efficient feature selection process, using the genetic algorithm, Galgo, [35], to generate recognition and classification of children activities models with the same classification algorithms and a reduced dataset, contained only by the most significant features of the activities analyzed, looking for the reduction in the size of the original dataset.

Therefore, the main question addressed in this work is the following: is it possible to develop a reduced features model for the classification of children’s activities based on the environmental sound through the implementation of a genetic algorithm?

Being a work proposed for use and implementation by mobile devices, this reduction in dataset size positively impacts the processing time and resources of the devices since it works with a smaller amount of data.

2. Materials and Methods

The methodology of this work consists of four main stages, beginning with a brief data preprocessing, where the audio signals are divided in shorter time intervals. Then, a set of features are extracted from the audio signals, which are subsequently subjected to a feature selection based on a genetic algorithm. For the validation of this feature selection, five different classification methods are applied.

The feature extraction process is performed using the Python programming language [36], while the feature selection process and the model validation are performed using the R development environment and the Galgo library [24].

2.1. Data Description

The recordings used in the present work are taken from the previous work [37]. These recordings belong to four different activities, shown in Table 1.

2.2. Data Preprocessing

The data preprocessing consists on dividing the audio signals in ten-second time intervals.

2.3. Feature Extraction

For the feature extraction, a set of numerical features are extracted from the ten-second intervals of the audio signals, as shown in Table 2.

These 34 extracted features are those that are commonly used in audio analysis and activity recognition activities through audio [3841], especially the mel-frequency spectral coefficients, since many audio analysis works have been developed that make use of them [21, 4244].

2.4. Feature Selection

The objective of the feature selection is to reduce the number of features by identifying those that present the most significant behavior in the classification of the audios. For this stage, the genetic algorithm (GA), “Galgo” [24], is implemented, using five different classification methods, in order to obtain subsets of five features. These classification methods consist of k-NN, NC, RF, ANN, and Rpart.

Galgo is a package developed under the R [45] language, implemented with an object-oriented approach and based on a general fitness function to guide the feature selection. The procedure of Galgo is designed to start with a random population of feature subsets, known as chromosomes. Each chromosome is evaluated measuring its ability to predict an output or a dependent feature, based on the accuracy. For this evaluation, a classification method is included. The main idea is to replace the initial population with a new one which includes features from different chromosomes that present a higher classification accuracy. This process is repeated enough times to achieve a desired level of accuracy. The progressive improvement of the chromosome population is performed by a process inspired on natural selection, based on three principles, selection, mutation, and crossover.

In resume, the application of Galgo consists of four main stages:(1)Setting-Up the Analysis. In this stage, the parameters of the algorithm are set, including the input data, the outcome, the statistical model, the desired accuracy, the error estimation scheme, and the classification method, among others.(2)Searching for Relevant Multivariate Models. This stage consists on the process of selection beginning with the random population of chromosomes, based on a classification method, looking for the best local solutions.(3)Refinement and Analysis of the Local Solutions. Then, the chromosomes selected are subjected to a backward selection strategy, since, even when these chromosomes present the best accuracy, there could be features included in the model that does not contribute significantly to the fitness value. The objective of this strategy is to derive a chromosome population containing only features that effectively contribute to the classification accuracy.(4)Development of a Final Statistical Model. Finally, a single representative model is obtained based on a forward selection strategy, where according to a stepwise inclusion, the most frequent genes presented in the chromosome population are selected.

2.4.1. Classification Methods

(1) k-NN. This method has been widely used in different statistical applications. It presents a nonparametric approach, where its basis consists of, from a set of training data, a group of k samples identified by being the nearest to unknown samples. To develop this process, the Euclidean distance, defined as , between a given set of queries and the inputs, is calculated, identifying the k closest input points for each query. Then, the output of the unknown samples is determined by calculating the average of the input features, based on the initial k samples [46].

(2) NC. This method presents a partitional clustering approach, where data objects are grouped into k clusters depending on their similarity. The parameter k must be specified initially. The first selection of centroids is done randomly. To measure the closeness between a centroid and a data object, the Euclidean distance, cosine similarity, is calculated. Then, after this first grouping, a new centroid for each cluster is obtained, as well as the distance between each data object to each center, reassigning the data points according to the distance. The point in a cluster is treated as the centroid if the sum of all the distances between this point and all the objects of the cluster achieve the optimal minimization. The main objective of NC is to minimize the sum of distances between the objects of a cluster and its centroid [47].

(3) RF. This classifier was developed by Breiman et al. [48]. Its performance consists on two levels of randomness for the construction of the trees, beginning with a bootstrapped version of a set of training data, called bagging, where a subset of the training data is intended for each tree, based on the principle of replacement, while the remaining data are used to estimate the error, calculating the out-of-bag (OOB) error. Then, in the second level of trees, a subset of features is randomly selected and added to each node throughout the growth of the decision trees. In each node, the best feature is selected, looking for reducing the label error. This method bases its classification technique by the principle of the majority vote from all the decision trees. This process is recursively repeated until reaching a defined depth in the forest or the number of samples in a node does not exceed a threshold [49, 50].

(4) ANN. In this method, the searching for a specific task based on the correlation between features is performed. Its process consists on a learning or training that resembles the behavior of the biological neural networks. Through a series of different labels contain nodes or neurons, ANNs try to find a relational model between the input features and outcome. Three main elements are present in this method: (1) a set of synapses or connections characterized by a “weight,” where the input signal is connected to a neuron through its product with the weight in that connection; (2) an adder, which aggregates the contributions of a signal pounded by all the weights; and (3) an activation function, equivalent to a transfer function, which is affecting the neurons, allowing to limit the amplitude of the outcome, providing a permissible range for the outcome signal in terms of finite values [51].

(5) Rpart. This is a statistical method for multivariate analysis which separate samples into different homogeneous risk groups to determine predictors of survival. The algorithm consists on selecting the predictor that provides the optimal split, such that each of the subgroups is more homogeneous with respect to the outcome. These subgroups are dichotomized into smaller and more homogeneous subgroups according to the feature that presents the best splits for each subgroup. A continuous iterative process is performed until presenting few values for additional splits. The pruning process is applied to the original partitioning tree in order to cut the tree back to the point where the overall predictive accuracy is maximized, preventing overfitting in the data [52].

3. Results

In Table 3, the number of recordings for each activity analyzed in the present work, as well as the number of ten-second clips generated after performing the corresponding division of the audios, is shown.

To perform the analysis, 34 features are extracted from each of the ten-second clips, consisting the dataset of 2716 samples.

Table 4 shows the general parameters of the genetic algorithm used in GALGO.

It is important to mention that, to select the chromosome size, the configuration presented by default by Galgo, which is five, was respected due to different points. First, because the size of the dataset is not very large, it was required to reduce the model to a size that would not compromise the performance of the classification and to avoid randomness in the development of the final model, with in order to test the robustness of the selection of features. On the contrary, this is the number that Galgo assigns by default if the user does not specify the size of the chromosome because, according to its description reported in the literature, this is the number of chromosomes that has presented the best performance for feature selection towards the different approaches. This is the same case for the selection of the number of niches [53].

Figures 15 show the performance of the fitness for each of the 200 generations evolved for the classification methods, k-NN, NC, ANN, RF, and Rpart, respectively. From these graphs, it can be observed from Figure 4, which correspond to RF, fitness (represented in the Y axis) is the one with a higher accuracy from the beginning of the process, being stable in terms of the evolving process from the GA.

Figures 615 show the frequency of appearance of the features in the 300 models developed for the k-NN, NC, ANN, RF, and Rpart classification methods, respectively. These features presented in color black are the five who presented the highest frequency of appearance and, therefore, selected as the most representative. From these, low computational cost algorithms (as k-NN, Rpart, and NC) acquire a stable frequency (shown in the Y axis) for the selected features, while complex models, as ANN and RF, tend to rank the genes with the capability to describe the phenomena.

From Figures 6, 8, 10, 12, and 14, Table 5 shows the five-feature subset determined by Galgo for each of the classifiers analyzed. In each case, the subset resulting from the feature selection process causes the accuracies shown in Table 6, where can be seen that all the classifiers achieve an average accuracy higher than 0.81, with RF being the one with the highest accuracy of 0.92.

From Table 6, it can be observed that some of the features selected using the RF classifier also appear in other models. These features are energy, spectral spread, spectral rolloff, and MFCC2, while the MFCC3 feature only appears in the model obtained with RF.

Table 7 shows the accuracies obtained by each classification method through the genetic algorithm, in comparison with the accuracies obtained in the previous work of feature selection by the Akaike criterion [37] and also comparing the results against classical methods of feature selection as forward selection and backward elimination. Table 7 also compares the results with the accuracies obtained by the models generated with the complete set of features. It can be observed that the feature selection process through the genetic algorithm, Galgo, performed in this work decreases by 80% the size of the original dataset, preserving the accuracy in the classification of the audios above 0.85 for four of the five classification techniques analyzed.

4. Discussion and Conclusions

The aim of the present work is to determine a reduced set of features, through the implementation of a selection process by a genetic algorithm, using different classification techniques, which allows the generation of a classification model of a set of children activities based on the environmental sound. Table 7 shows a comparison between the results obtained in the present work and those reported in previous works in addition to those obtained by classical feature selection techniques. In these results, it can be observed that when no feature selection technique is used (generating the models with the complete dataset), the k-NN classifier achieves 100% accuracy, but the RF classifier achieves an accuracy of only 76%, from which it can be concluded that the fact of using all the features does not ensure a perfect classification; in addition, k-NN seems to be prone to an over fit given the 34 features since there will be some features that do not provide relevant or redundant information for the classification, in addition to the fact that the use of more data directly impacts the processing times.

When analyzing the results of the models that implement feature selection, when generating models of 27 features, which result from the implementation of the Akaike criterion selection, the classification accuracy for the RF classifier rises to 95%, which means that this classifier benefited by applying feature selection, eliminating redundant data and other type of information that affected the classification, while for the k-NN classifier, the accuracy is maintained above 94%, reducing the amount of data used by 20%.

The classical methods of feature selection shown in the results, forward selection and backward elimination, present similar results when models with 30 features are generated, which reach accuracies greater than 88%. While reducing the number of features to 5, using the same methods, the accuracies are above 78%, which represents a considerable reduction in the accuracy of the classifiers, although with the benefit of the reduction in the amount of data used.

Therefore, the GA approach demonstrates an important feature reduction, only 14% of original features (5), reaching higher accuracy than the models created with more features (30 or 34) and almost the same of models with 27 features (Akaike criterion), which in terms of mobile devices means less processing power consumption needed with almost the same accuracy than much more complex models.

From GA behavior described in figure 1 to figure 15, there are several interesting points that must be mentioned:(i)Figures 1, 2 and 5, which correspond to the evolutionary process of k-NN, NC, and Rpart, present a particular behavior. These low-computational cost algorithms in the mutation process include features that decrease the fitness of the model, and finally these features are excluded almost immediately from the selection.(ii)In Figures 7, 9 and 15, the comportment described above is remarked, almost equivalent frequency of the finally selected features for each of the algorithms.(iii)Otherwise, it is shown in Figures 3 and 4 that algorithms such as RF and ANN have a smooth evolutionary process, given the capabilities to find more complex relationship between features.

Hence, from the results obtained in Section 3, it can be concluded that the five-feature sets determined by the genetic algorithm used achieve similar accuracies to the models generated using the original set of features, but with the advantage that there is an 80% reduction in the dataset size, including all its implications, such as a faster data processing, which is very important when working with applications for mobile devices since its resources are limited.

5. Future Work

The present work allows us to demonstrate that it is possible to generate children activity classification models using environmental sound by applying feature selection through genetic algorithms, preserving accuracies greater than 81% and reducing the set of data used by 85%; however, some aspects can be worked on in the future. The proposed future work is as follows:(i)To work with a different set of features. To extract a larger feature set from the audio files to verify the behavior of the feature selection methods now analyzed with a different set of features and contrast the results with those obtained in this work to finally determine the subset of features that best describes the activities.(ii)To combine feature selection methods. Generate classification models using subsets of features resulting from the combination of the method proposed in the present work with the classical methods of feature selection such as forward selection and backward elimination.

Data Availability

Open data files are available at https://ingsoftware.reduaz.mx/amidami

Conflicts of Interest

The authors declare that they have no conflicts of interest.