Abstract

Mobile applications have become a must in every user’s smart device, and many of these applications make use of the device sensors’ to achieve its goal. Nevertheless, it remains fairly unknown to the user to which extent the data the applications use can be relied upon and, therefore, to which extent the output of a given application is trustworthy or not. To help developers and researchers and to provide a common ground of data validation algorithms and techniques, this paper presents a review of the most commonly used data validation algorithms, along with its usage scenarios, and proposes a classification for these algorithms. This paper also discusses the process of achieving statistical significance and trust for the desired output.

1. Introduction

There has been an increase of the number of mobile applications that make use of sensors to achieve a plethora of goals. Many of these applications are designed and developed by amateur programmers, and that in itself is good as it confirms an increase in the overall set of skills of the developer community. Nevertheless, and even when the applications are developed by professionals or by companies, there are not many applications that publicize or disclose how the sensors’ data is processed. This is a problem, in particular when these applications are meant to be used in a scenario where they can influence their users’ lives, as for example, when the data is expected to be used to identify Activities of Daily Living (ADLs) or, to an extreme, when the applications are used in medical scenarios.

Due to the nature of the mobile device itself, multiprocessing, with limited computational power and limited battery life, the data that is collected from the sensors is often unusable in its primary form, requiring further processing to allow it to be representative of the event or object that it is supposed to measure. The recording of sensor data and the sequent processing of this data need to include validation subtasks that guarantee that the data are suitable to be fed into the higher-level algorithms.

Moreover, the use of the sensors’ data to feed higher-level algorithms needs to guarantee a minimum degree of error, with this error being the difference between the output of these applications, built on limited computational mobile platforms, and the output of a golden standard. To achieve a minimum degree of error, statistical methods need to be applied to ensure that the output of the mobile application is to maximum extent similar to the output given by the relevant golden standard, if and when this is possible.

To mitigate this problem, this paper presents and discusses the most used data validation algorithms and techniques and their usage in a mobile application that relies on the sensors’ data to give meaningful output to its user. The algorithms are listed and their use is discussed. The discussion of the statistical process to ensure maximum reliability of the results is also presented.

The remainder of this paper is organized as follows: this paragraph concludes Section 1, where a short introduction to the problem and a proposal to achieve its mitigation are disclosed; Section 2 presents the most commonly found data validation methods, along with a critical comparison of its usage scenarios; Section 3 deepens the analysis presenting a classification of the data validation methods; Section 4 discusses the applicability of these methods, including the discussion of the degree of trust the data can be expected to provide; finally, Section 5 presents relevant conclusions.

2. Data Validation Methods

Sensor data validation is an important process executed during the data acquisition and data processing modules of the multisensor mobile system. This process consists of the validation of the external conditions of the data and the validity of the data for specific purpose, in order to obtain accurate and reliable results. The sequence of this validation may be applied not only in data acquisition but also in data processing since increase, as these increase the degree of confidence of the systems, with the confidence in the output being of great importance, especially for systems involved in medical diagnosis, but also for the identification of ADLs or sports monitoring.

In addition, data validation methods must be used during the different phases of the conception of a new system, such as design, development, tests, and validation. Therefore, the data validation methods with verified reliability during the conception should be also used to validate the data automatically during the execution time.

One of the causes for the presence of incorrect values during the data acquisition process may be existence of environmental noise. Even when the data is correctly collected, the data may still be incorrect because of noise. Therefore, very often the data captured or processed has to be cleaned, treated, or imputed to obtain better and reliable results. Following the existence of missing values at random instants of time, the causes may be the mechanical problems or power failures of sensors. At this case, data correction methods should be applied, including data imputation and data cleaning. The data validation process may be simplified as presented in Figure 1.

The selection of the best technique for sensor data validation also depends on the type of data collected, the purpose of its application, and the computational platform where the algorithm will be run. Data validation techniques are commonly composed by statistical methods. Due to the characteristics of mobile devices, data validation techniques can be executed locally in the mobile device or at the server-side, depending on the amount of data to validate simultaneously, the frequency of the validation tasks, and the computational, communication, and storage resources needed for the validation. The characteristics of the sensors are also important for the selection of the best techniques, which may be separated in three large groups, which are sensor performance characteristics, pervasive metrics, and environmental characteristics [1].

While data validation is important for improving the reliability of a system, it also depends on other factors, such as power instability, temperature changes, out-of-range data, internal and external noises, and synchronization problems that occur when multiple sensors are integrated into a system [2]. However, the reconstruction of the data and correction for the correct measurement is also important, and several research studies have proposed systems, methods, models, and frameworks to improve the data validation and reconstruction [3, 4].

Sensor data validation methods can be separated in three large groups, such as faulty data detection methods, data correction methods, and other assisting techniques or tools [5].

Firstly, faulty data detection methods may be either simple test based methods or physical or mathematical model based methods, and they are classified in valid data and invalid or missingness data [6, 7]. For the detection of faulty data, the authors in [7] presented an order of methods that should be applied to obtain better results, which are as follows: zero value detection, flat line detection, minimum and maximum values detection, minimum and maximum thresholds based on last values, statistical tests that follow certain distributions, multivariate statistical tests, artificial neural networks (ANNs) [8], one-class support vector machine (SVM) [9], and classification and physical models. On the one hand, simple test based methods include different techniques, such as physical range check, local realistic range detection, detection of gaps in the data, constant value detection, the signals’ gradient test, the tolerance band method, and the material redundancy detection [7, 10, 11]. On the other hand, physical or mathematical model based methods include extreme value check using statistics, drift detection by exponentially weighted moving averages, the spatial consistency method, the analytical redundancy method, gross error detection, the multivariate statistical test using Principal Component Analysis (PCA), and data mining methods [7, 12, 13].

Secondly, data correction methods can be carried out by interpolation, smoothing, data mining, and data reconciliation [10, 12, 14]. For the application of the interpolation, the authors of [11] proposed the use of the value measured from the last measurement or the use of the trend from previous sets of measurements. The smoothing methods, for example, moving average and median, may be used to filter out the random noise and convert the data into a smooth curve that is relatively unbiased by outliers [10]. The application of data mining techniques allows the replacement of the faulty values by the measurements performed with several methods, for example, ANNs [14]. The data reconciliation methods, for example, PCA, are used for the calculation of a minimal correction to the measured variables, according to the several constraints of the model [13].

Thirdly, the other assisting techniques or tools are, namely, the checking of the status of the sensors, the checking of the duration after sensor maintenance, data context classification, the calibration of measuring systems, and the uncertainty consideration [6, 7, 10].

Several research studies have been performed, using data validation techniques. In [15], PCA is used for the compression of linearly correlated data. The authors compared the Auto-Associative Neural Network (AANN) and the Kernel PCA (KPCA) methods for data validation, creating a new approach named as Hybrid AANN-KPCA that uses these two methods. When compared with AANN and KPCA methods, the Hybrid AANN-KPCA achieves better performance results for the prediction or correction of inconsistent data.

In [16], the authors proposed that the data validation may be performed with Kalman filtering and linear predictive coding (LPC), showing that the results using Kalman filtering are better than LPC using several types of data, but the LPC reported a smaller energy consumption.

Several studies proposed the use of ANNs, for example, the Multilayer Perceptron (MLP), that can be trained to perform the identification of faulty sensors using prototype data and used to determine the near optimal subset of sensor data to produce the best results [2, 1719]. Besides, the sensor data validation may be performed with other probabilistic methods, such as Bayesian Networks, Propagation in Trees, Probabilistic Causal Methods, and Learning Algorithms [20]. The authors of [20] proposed the anytime sensor validation algorithms that combine several probabilistic methods. On the contrary, [21] proposed the validation of data using the Sparse Bayesian Learning and the Relevance Vector Machine (RVM), which are an specialization of SVM.

For the estimation of the values during data validation, the authors of [22] analysed the use of the Kalman filter, which was implemented in two methods: Algorithmic Sensor Validation (ASV), and Heuristic Sensor Validation (HSV). The ASV method implements different statistical methods, for example, mean, standard deviation, and sensor confidence that represent the uncertain nature of sensors. HSV identifies faulty sensor readings as attributable to a sensor or system failure. As an example, the authors of [23] proposed the use of the Kalman filter for the validation of the GPS data.

Other used methods are the grey models, which consists of differential equations describing the behaviour of an accumulated generating operation (AGO) data sequence. As an example, [4] presented a novel self-validating strategy using grey bootstrap method (GBM) for data validation and dynamic uncertainty estimation of self-validating sensor. The GBM can evaluate the measurement uncertainty due to poor information and small sample.

In [2], the Autoregressive Moving Averages (ARMA) transform the process for determining the validity of the acquired data, evaluating the levels of noise and providing a timely warning from the expected signals. The model created for ARMA includes linear regression techniques to predict the invalid values with Autoregressive (AR) and Moving Average (MA) models. Sensor Data Validation in Aeroengine Vibration Tests also implements the Autoregressive (AR) Model, complemented with the Empirical Mode Decomposition (EMD) [24]. Another method presented is the sensor validation and fusion of the Nadaraya-Watson statistical estimator [25], using a Fuzzy Logic model [26]. These methods and others, including the use of Gaussian distributions and error detection methods, may be also used to improve the quality of the measurements [27, 28].

Intelligent sensor systems are able to perform the capture and validation of the sensors’ data. Staroswiecki [29] argues that the data validation is important to increase the confidence level of these systems, proposing two types of validation, such as technological and functional. Technological validation consists on the analysis of the conditions of the hardware resources of the sensors, but it does not guarantee that the estimation produced by the sensor is correct, but only that the operating conditions were not against possible correctness. On the contrary, functional validation consists of Fault Detection and Isolation (FDI) procedures, which consists of the use of algorithms to complement the Technological Validation. The authors of [30] also agreed with Staroswiecki in the separation of the data validation in two types, presenting a real time algorithm based on probabilistic methods. Other studies have been researched and developed, including the data validation techniques using intelligent sensor systems [31].

Another powerful technique for data validation consists of the use of self-validating (SEVA) sensors, which provide an estimation of the error bounds during the measurements [32]. SEVA are widely researched in literature. An example, using a Back-Propagation (BP) model, is applied into a system to obtain an estimated value and then a fault detection method called SPRT (sequential probability ratio test), identifying the validity of the system [33]. For the use of SEVA technologies, the authors of [34] also proposed the validated random fuzzy variable (VRFV) based uncertainty evaluation strategy for the online validated uncertainty (VU) estimation. In [35], the authors presented a novel strategy of using polynomial predictive filters coupled with VRFV which is proposed for the online measurements validation and validated uncertainty estimation of multifunctional self-validating sensors. These authors also performed a research about the use of some fuzzy logic rules, comparing the predicted values with the actual measurements to obtain the confidence evaluation [36]. In [37], the authors proposed an approach of sensor data validation using self-reporting, including the measurement based on the data quality, that is, validating the data loss measured by periodic sensors, the timing of data collection, and the accuracy of the detection of changes. ANNs may be used for SEVA with self-organizing maps (SOM) [38], which are trained using unsupervised learning techniques to produce a low-dimensional, discretized representation of the input space of the training samples [39].

The use of valid data is important for the developments of intelligent sensor systems, which may be used for health purposes and, consequently, for the detection of the ADLs [4045]. The use of mobile devices allows the data acquisition anywhere and at anytime, but these devices have several constraints, such as low memory, processing power, and battery resources, but data validation may help for increasing of the performance of the measurements, reducing the resources needed [4648]. In general, these systems use probabilistic methods to detect the failures at real-time to obtain better results.

Table 1 presents a summary of the data validation methods included on each category. The methods that are mainly implemented use statistical and artificial intelligence techniques, such as PCA, RVM, ANNs, and others, increasing the reliability of the data acquisition and data processing algorithms. In spite of the SVM and the ANN working in a slightly different manner, their foundations are quite similar. In fact the SVM without kernel is a single neural network neuron with a different cost function. Congruently, when the SVM has a kernel it is comparable with a 2-layer ANN.

Following the methods presented at Table 1, the most studied scenarios for data validation are mainly related to health sciences, laboratory experiments, and other undifferentiated tasks. However, only a minor part of studies is related to the use of mobile devices, smart sensors, and other devices used daily. Besides, the development of healthcare solutions based on the sensors available on the mobile devices increases the requirement of the validation of the data collected by the sensors available on the mobile devices. Depending on the types of the data, for some complex data acquired, such as images, videos, GPS signal, and other complex types of data, the validation of the data should be accomplished by other auxiliary systems working at the same time, validating the data at the server-side, but a constant network connection must be available. Other topologies of systems may be susceptible for the implementation of data validation techniques. The Wireless Sensor Networks (WSN) are an example of systems where the different nodes of the network may perform the validation of the data collected for the neighbourhood nodes, and these nodes may be composed of different types of sensors. However, the main topology for the implementation with mobile devices is the self-validation using only the sensors and the data available on the mobile device.

The data validation may be executed automatically and transparently for the mobile devices’ user and, commonly, at least one of the methods for each stage should be implemented in a system to perform the validation of the sensors’ data. Firstly, for faulty data detection methods, the ANNs are the most used methods for the training of the data and for the detection of the inconsistent values. Secondly, for data correction methods, the most used method is the Kalman filter. Thirdly, the other assisting techniques that are commonly applied are the data context classification, the checking of the status of sensors, and the uncertainty considerations. Applying the data validation techniques correctly, the reliability and acceptability of the systems may be increased.

3. Classification of Data Validation Methods

Data validation methods may be classified in three large groups [5] as follows: faulty data detection methods, data correction methods, and other assisting techniques or tools.

The faulty data detection methods and the data correction methods may be executed sequentially in a multisensor system in order to obtain the results based on valid data. The other assisting techniques or tools mainly consist of the validation of the working state of the sensors, and this validation may be executed at the same time of the execution of faulty data detection and data correction methods, because these types of failures invalidated the results of the algorithms. These different approaches are based on either mathematical methods, for example, statistical or probabilistic methods, or complex analysis, for example, artificial intelligence methods. According to [49], the data validation methods may be classified in several types of methods, which are presented in Figure 2.

As depicted in Figure 2, the faulty data detection methods, used to detect failures on the sensors’ signal, may include ANNs, dimensional reduction methods, instance based methods, probabilistic and statistical methods, and Bayesian methods. On the contrary, the data correction methods include the following methods: filtering, regression, estimation, interpolation, smoothing, data mining, and data reconciliation. These methods work specifically with the sensors’ data and the selection of the methods that can be applied by a system should consider the system’s usage scenarios.

Finally, the other assisting techniques or tools are mainly related to detection of problems originated by either hardware components or its working environment. In addition, on real-time systems, these problems should be verified constantly to prevent the existence of failures in the data captured.

4. Applicability of the Sensor Data Validation Methods

Mobile devices have a plethora of sensors available for the measurement of several parameters, including the identification of the ADLs. Examples of these sensors include the accelerometer, the gyroscope, the magnetometer, the GPS, and the microphone.

The data acquisition using accelerometers may fail because of several problems, including problems related with the internal electronic amplifier of the Integrated Electronic Piezoelectric (IEPE) device, the exposure to temperatures beyond the accelerometer working range, failure related with electrical components, capture of environmental noise, the multitasking and multithreading capabilities of the mobile devices that may cause irregular sampling rates, the positioning of the accelerometer, the low processing and memory power, and the battery consumption [50]. The causes of failure of an accelerometer are similar to the causes of the failure of a gyroscope, a magnetometer, or a microphone [51]. In addition, the GPS has another failure cause, which consists of the low connectivity of satellites in indoor environments [52].

The validation of the data is important, but, for critical systems, for example, clinical systems, not only should the input data be validated, but also the results should be validated to guarantee the reliability, accuracy and, consequently, acceptance of the system. The validation of the system may consist of the detection of failures and the methods that may be applied are the faulty detection methods. As presented in Section 3, the methods that may be included in this category are probabilistic and statistical methods, among others, which may be used to validate the results of the system. This validation can be performed by comparing the results obtained by an equivalent system which is considered to be a gold standard [53] with the results obtained by the developed methods implemented by different sensors or devices, for example, a mobile device.

Once estimated the initial error of the system, that is, how different the obtained results are from the results obtained by the gold standard system, the validation of the results of the system consist of three steps, such as the definition of the confidence level needed for the acceptance of the system, the determination of the minimum number of experiments needed to validate the application with confidence level defined, and the validation of the results when compared to a golden-standard [54]. The definition of the degree of confidence of the system is a choice of the development team. The system design leader may define what system needs to have a maximum 5% error 95% of the times. Using these parameters, a minimum number of calibration experiments need to be performed to allow the fine tuning of the algorithm. The minimum number of experiments may be measured by several statistical tests, for example, Student’s -test [55].

After the calibration of the algorithms in the system, further tests and comparison with golden-standard systems can be done to insure that the results reported by the system have a 5% maximum error when compared to the golden standard results, for 95% of the time. Note that the 5% and 95% values are merely indicative. Moreover, the data collection stage must hold into consideration the limits for the optimal functioning of the sensors. As these limits are extremely dependent on the task the sensors must perform, we do not discuss them in this paper, for example, if the application is supposed to track the movements of a sportsperson in an open environment, it is possible that a thermal sensor reports an environment temperature of −5°C, yet, for an application that tracks the indoor activity of an elder, such value should raise an alarm. In this extreme case, it is even possible that more robust systems need to contain different types of sensors.

5. Conclusion

The validation of the data collected by sensors in a mobile device is an important issue for two main reasons: the first one is the increasing number of devices and the applications that make use of the devices’ sensors; the other is that also increasingly users rely on these devices and applications to collect information and make decisions that may be critical for the user’s life and well-being.

Despite the fact that there is a wide array and types of data validation algorithms, there is also a lack of published information on the validity of many mobile applications. Also, it is impossible to present a critical comparison of the discussed methods, even within their respective categories, as their efficiency is extremely dependent on their particular usage; for example, the efficiency of a specific method may be very dependent on the number and type of features the algorithm selects on the signal to be processed, and of course these features are chosen in view of the intended purpose of the application. Additionally, it is possible that even with the same chosen method and the same chosen set of features, different authors report different efficiency ratios; for example, their base population sample varies in size and/or type using different population sizes or using populations that are homogenous in age (elders or youngsters).

This paper has presented a discussion on the different types of data validation methods such as faulty data detection, data correction, and assisting techniques or tools. Furthermore, a classification of these methods in accordance with its functionalities was provided. Finally, the relevance of the data validation methods for critical systems in terms of its reliability, accuracy and acceptance was highlighted. Complementary studies should be addressed aiming at providing an overview on the use of valid data for the identification of the ADLs.

Competing Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported by FCT project UID/EEA/50008/2013 (Este trabalho foi suportado pelo projecto FCT UID/EEA/50008/2013). The authors would also like to acknowledge the contribution of the COST Action IC1303 Architectures, Algorithms and Protocols for Enhanced Living Environments (AAPELE).