Computational Intelligence and Neuroscience

Volume 2016 (2016), Article ID 9467878, 12 pages

http://dx.doi.org/10.1155/2016/9467878

## Improved Correction of Atmospheric Pressure Data Obtained by Smartphones through Machine Learning

^{1}Department of Computer Science and Engineering, Kwangwoon University, 20 Kwangwoon-ro, Nowon-gu, Seoul 01890, Republic of Korea^{2}Department of Embedded Software Engineering, Kwangwoon University, 20 Kwangwoon-ro, Nowon-gu, Seoul 01890, Republic of Korea^{3}Department of Computer Engineering, College of Information Technology, Gachon University, 1342 Seongnam-daero, Sujeong-gu, Seongnam-si, Gyeonggi-do 13120, Republic of Korea^{4}Korea Oceanic and Atmospheric System Technology, No. 1503, STX W-Tower, 90, Gyeongin-ro 53-gil, Guro-gu, Seoul 08215, Republic of Korea^{5}Observation Research Division, National Institute of Meteorological Sciences, 33 Seohobuk-ro, Seogwipo-gi, Jeju-do 63568, Republic of Korea^{6}Geography and Environment, University of Southampton, University Road, Southampton SO17 1BJ, UK

Received 6 November 2015; Accepted 9 June 2016

Academic Editor: Elio Masciari

Copyright © 2016 Yong-Hyuk Kim et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

A correction method using machine learning aims to improve the conventional linear regression (LR) based method for correction of atmospheric pressure data obtained by smartphones. The method proposed in this study conducts clustering and regression analysis with time domain classification. Data obtained in Gyeonggi-do, one of the most populous provinces in South Korea surrounding Seoul with the size of 10,000 km^{2}, from July 2014 through December 2014, using smartphones were classified with respect to time of day (daytime or nighttime) as well as day of the week (weekday or weekend) and the user’s mobility, prior to the expectation-maximization (EM) clustering. Subsequently, the results were analyzed for comparison by applying machine learning methods such as multilayer perceptron (MLP) and support vector regression (SVR). The results showed a mean absolute error (MAE) 26% lower on average when regression analysis was performed through EM clustering compared to that obtained without EM clustering. For machine learning methods, the MAE for SVR was around 31% lower for LR and about 19% lower for MLP. It is concluded that pressure data from smartphones are as good as the ones from national automatic weather station (AWS) network.

#### 1. Introduction

Severe weather, such as local torrential rains, gusts, or environmental disasters, is being found more frequently in recent years. Public warnings and alerts on the basis of near real-time observation are therefore increasingly important especially for highly populated cities. Large numbers of observations are to be made in the area of interest for monitoring weather-related events, but their spatial resolution from conventional national scale network of automatic weather stations (AWSs) is often insufficient. Although several studies [1–4] have been conducted for feasibility of portable meteorological equipment to enhance weather observation and forecast, increased use of portable meteorological equipment is still limited due to geographic constraints and financial reasons.

The advent of microelectromechanical systems (MEMS) sensors opened up new possibilities in the field of weather observation. Smartphones have been widely equipped with these devices, whose performance has also been improving quickly in response to user demand. Potential candidate MEMS-based sensors in most smartphones for meteorological observations are atmospheric pressure, temperature, and relative humidity. Thus, it is expected that smartphones may be used to obtain more specific meteorological data at a low cost, even if only for some basic weather variables. However, other studies [5, 6] have pointed out that issues of sensor performance and data reliability need to be resolved in order to utilize data obtained by sensors in smartphones.

In a previous study, we proposed a correction method that minimizes errors between the data obtained by smartphones and meteorological data of the Korea Meteorological Administration (KMA) by collecting the data from MEMS meteorological sensors built into smartphones using an application called* Yeowoobi* [7], which can obtain such data from smartphones with Android OS 4.0 or greater and store them in a separate server. There have been several studies or guidelines published [8–10] on error correction of public meteorological equipment, a study [11] on the analysis of air temperature by using battery temperature measurements in smartphones, and a study [12] on observation of surface pressure, but our previous study was the first to use smartphones to correct atmospheric pressure data.

Our current study is intended to enhance the correction method used in our previous study by classifying the data previously obtained according to time, considering human mobility patterns, and using various machine learning methods. Data obtained and preprocessed in the same manner as in our previous study [7] were classified according to time (daytime or nighttime and weekday or weekend) based on user behavior patterns. They were automatically reclassified through clustering, and various machine learning methods such as linear regression (LR), multilayer perceptron (MLP), and support vector regression (SVR) were applied to them in order to analyze the results for comparison. Each machine learning method was established by identifying a parameter value leading to the optimal result, and the time required for determining this parameter value was also considered.

This paper is organized as follows. Section 2 introduces the machine learning methods used to improve the existing correction method in this study; Section 3 describes the meteorological data used in this study as well as the quality control (QC) preprocessing and the classification of data by time to compare the results with those from the previous study; Section 4 identifies the method that exhibits the best performance by analyzing the results of using various machine learning methods (i.e., clustering, LR, MLP, and SVR) based on data in the fields added; Section 5 analyzes the experimental results; and Section 6 presents considerations and directions for future work.

#### 2. Machine Learning

WEKA [13] is a machine learning program developed by the University of Waikato in New Zealand enabling the user to analyze data and to perform prediction modeling by using various machine learning algorithms. In this study, WEKA was used for data analysis by applying LR, MLP, SVR, and expectation-maximization (EM) clustering algorithms. These algorithms are described briefly in the subsections that follow.

##### 2.1. Linear Regression

LR is a regression analysis method used for modeling a linear relationship between more than one independent variable and a dependent variable. It combines weights whose initial values are provided and data attributes to represent each layer in the form of a linear equation. The predicted value of the th layer can be represented as Weights (s) are derived from the number of learning data. The difference between the calculated predicted value and the actual value is calculated by (2) as well as weights (s) that minimize the difference to derive an LR equation:

##### 2.2. Multilayer Perceptron

A multilayer neural network [14] is a nonlinear classification method based on Perceptron, which is a linear classifier, but unlike the existing Perceptron, it has a hidden layer between the input layer and the output layer. Learning in a multilayer neural network can be roughly divided into two stages. The first stage is a forward computation that calculates a predicted value from the input layer to the output layer, and the second stage is an error backpropagation that renews weights to minimize the error between the predicted value and the actual value. Given a multilayer neural network that has node(s) in one hidden layer, nodes in the input layer, and nodes in the output layer, a forward computation is performed using (3) to calculate from the input layer to the hidden layer and (4) to calculate from the hidden layer to the output layer:where is an activation function. Typically, a sigmoid function as shown in (5) is the most widely used, and a gradient is determined according to the values:The error () between the value obtained through the forward computation and the actual value is defined as shown in The error backpropagation process, which renews weights and in order to reduce , is repeated for each generation through The MLP equation is derived by using the optimal weights obtained through the process above.

##### 2.3. Support Vector Regression

SVR is a support vector machine (SVM) algorithm that is used to solve regression problems and that can also be applied in nonlinear prediction. In contrast to the existing algorithms, including neural networks, it leads to the optimized generalization performance by maximizing a space that exists between two layers.

Instances that are the most adjacent to a hyperplane that has the maximum space or instances located the shortest distance from a plane are called support vectors. Only one set of these support vectors determines the hyperplane that has the maximum space regarding a learning problem; the other instances are irrelevant to learning. The SVR equation is shown as

All the results calculated by using a kernel function and a test sample for every having support vector(s) are added together. is a Lagrange multiplier, is an integer that represents the category, and is a constant that represents the location on the hyperplane.

In addition, the SMOreg algorithm is a kind of SVR based on the sequential minimal optimization (SMO) algorithm, which is an optimization algorithm proposed to use SVR [15]. Whereas an inefficiency problem is caused in the SMO algorithm because there is only one threshold, this problem is solved in the SMOreg algorithm by using two thresholds [16].

##### 2.4. Expectation-Maximization Clustering

EM clustering is an iterative algorithm that first estimates initial values for unobservable parameters and then calculates the cluster probability of each instance by using the initial values to find the parameter value having the maximum likelihood [17].

First, after initial values for the parameters in each cluster have been assigned, the probability for each instance to be included in clusters is calculated. Then, parameters that have the maximum likelihood are recalculated by using the instance points included in each cluster. This process is performed repeatedly until the parameter values for each cluster do not change.

#### 3. Experimental Data

##### 3.1. Smartphone Data

Meteorological data for South Korea for dates between July 1 and December 31 in 2014 were obtained using a smartphone application called* Yeowoobi* (the term* Yeowoobi* means sunshower in Korean). The meteorological data collected by this application include the time at which data are received at a server, transmission methods, location precision time information (i.e., year, month, day, hour, minute, and second), latitude (degrees), longitude (degrees), spot atmospheric pressure (hPa), user identification number, temperature (°C), relative humidity (%), and smartphone information. The initial cycle for obtaining data for atmospheric pressure, temperature, and relative humidity takes 10 minutes. Users of the* Yeowoobi* app select one of nine stages for an observation cycle (from one minute to three hours) based on various factors such as battery consumption and cost of the data transfer.

In this study, the subset of meteorological data obtained in Gyeonggi-do (including Seoul) was used as the experimental data. When latitude and longitude are calculated to the third decimal point, the number of data obtained is approximately two million (47% of the entire data set collected), and the number of users is approximately two thousand (63% of the entire set of users) (Table 1). Figure 1 shows a map of the observation data across South Korea obtained by smartphones, showing location information for the observation data in Gyeonggi-do. In addition, among 692 points of public meteorological equipment throughout the country, 238 points are located in Gyeonggi-do; 53 of these 238 sites correspond to locations where data were also collected by smartphones.