Abstract
In the atmospheric science, the scale of meteorological data is massive and growing rapidly. Kmeans is a fast and available cluster algorithm which has been used in many fields. However, for the largescale meteorological data, the traditional Kmeans algorithm is not capable enough to satisfy the actual application needs efficiently. This paper proposes an improved MKmeans algorithm (MKmeans) based on MapReduce according to characteristics of large meteorological datasets. The experimental results show that MKmeans has more computing ability and scalability.
1. Introduction
In the atmospheric sciences, meteorological data is extremely rich and valued, which requires a mass of scientific computing, and provides services to the communities. With the further expansion of meteorological services and the improvement of the modernization standard in meteorology, a large amount of meteorological information has been accumulated and collected in meteorological services, research and management activities. Highperformance computers are required to process these data, but small organizations and units cannot afford the high price of highperformance computers. Cloud Computing technology provides the cheap computing services for the Meteorological Organization with higher efficiency, lower cost, and lower carbon. Climate data are dramatically increasing in volume and complexity, since users of these data in the scientific community and the public are rapidly increasing [1]. Faced to such largescale meteorological data, highefficient computing power (more than a trillion times) is urgently required. Therefore, establishing a cloud computing weather information processing system is very important and significant.
MapReduce is a key technology of using cloud computing to process a large amount of data. It is a parallel programming model and an associated implementation for processing and generating large datasets in a broad variety of real world tasks proposed by Google. It is not only a programming model, but also a task scheduling model. It is compose of two fundamental functions: Map and Reduce, defined by users. A Map function is to handle a key/value pair to produce intermediate key/value pair. A Reduce function is specified to combine all of the intermediate value with the same middle key [2]. MapReduce is typically used to perform distributed computing on clusters of computers. Google’s MapReduce abstracts the distributed computing from its complex details; such that programmers can handle large distributed system resources without any experience about a parallel or distributed system. Thereby, the effect originally achieved only by expensive highperformance computer can be achieved by lowcost computing services.
As we know, not all data mining algorithms can be parallelized to handle large datasets at this moment. Some algorithms cannot be parallelized in theory. Some need to be adapted to take the advantage of the efficiency of parallelization. In this paper, we utilize the means algorithm in the MapReduce framework. Specifically, we adapt the means algorithm in an opensource software framework: Hadoop, and apply the parallel means algorithm (means) to cluster the largescale weather data.
This paper is organized as follows. Related work is reviewed in Section 2. In Section 3, we introduce the MapReduce programming model. In Section 4, we describe our parallel means algorithm (means) for largescale meteorological data using MapReduce. In Section 5, we conduct the experiment to evaluate the means algorithm by applying it to cluster largescale meteorological data. Finally, we conclude the paper in Section 6.
2. Related Work
In recent years, there is significant research in means clustering and MapReduce. means clustering problem has been well studied in data mining research and related fields. means is one of the top 10 algorithms in data mining [3]. Its simplicity and speed allow it to run on large datasets. With the development of information technology, the volume of information is becoming more and more enlarging. MapReduce is a quite novel programming model for solving certain kinds of distributable problems and processing large datasets [2]. So, to deal with high dimensions and large datasets, some researchers have proposed some methods to solve these problems [4–8]. Böse et al. [9] implemented several incremental data mining algorithms including Naïve Bayes and PCA and applied their methods to deal with largescale datasets. Chu et al. [10] realized a few algorithms based on MapReduce, such as SVM, ICA, PCA, Gaussian Discriminant Analysis, EM and Backpropagation. Chao et al. [6] proposed a parallel Comeans algorithm based on MapReduce, which basically distributes the clustering load over a given number of processors. Reference [8] adapts an ensemble learning methodbagging to overcome the instability and sensitivity to outliers in clustering on large datasets. There has been work on developing algorithms and approximation algorithms that fit into the MapReduce [11]. Apache Hadoop [2] is a free Java MapReduce framework that allows the parallel or distributed processing of large datasets. Zhao et al. [4] presented a fast parallel means clustering algorithm based on the MapReduce framework; however, their approach does not consider the characteristics of large meteorological datasets and cannot achieve good results. Reference [12] demonstrated the utility of the means clustering algorithm for identifying relationships between winds at turbine heights and climate oscillations, thereby developing a method for predicting the impacts of climate changes on wind resources. However, only a few studies on dealing with the largescale meteorological data using MapReduce have been reported.
In this paper, we present a parallel clustering algorithm means which is based on both means and MapReduce for very large meteorological data.
3. MapReduce Overview
As said before, MapReduce is developed by Google. Its libraries have been written in many programming languages, such as Java, Python, and C++ [13–16]. It is mainly used to process largescale (TBlevel) data files. MapReduce is not only a simplified programming model, but an efficient distributed scheduling model. Programming is very simple in such a cloud computing environment. The treatment of clusters is handled by the platform, including the reliability and scalability [17]. Application developers only need to focus on the application itself. “Map” and “Reduce” are the two basic computing units of the MapReduce model. Massive data is cut into unrelated blocks by Map program, and scheduled to lots of computers to process, achieving distributed computing. Then the results from these computers are summarized and outputted by Reduce program.
In MapReduce, massive data is processed in parallel. Data is initially partitioned across the nodes (computers) of a cluster and stored in a distributed file system (DFS). Data is represented as (key, value) pairs. The computation of the two functions is expressed formally as follows [5]: , .
The Google’s MapReduce programming model is shown in Figure 1.
To further understand the MapReduce programming model, the pseudocode of program based on MapReduce is shown in Algorithm 1. The program is used to calculate the annual maximum temperature [18].

A Map function is used to extract all the years and temperatures (key/value pairs) appeared in text, and these pairs are sent to an intermediate temporary space specified by MapReduce. Through intermediate processing by the Map function, the key/value pairs are grouped according to the key, so that each year is followed by a list of temperatures. Then, a Reduce function is only to find the maximum number through a whole list. The result is the annual maximum temperature.
Figure 2 shows the intermediate results of each step of the execution process of MapReduce, including Map and Reduce phases, which both use all nodes in the cluster. Between the Map and Reduce phases, there is an intermediate phase, which concatenates the intermediate results with the same key into a list. The list will be used by the Reduce function to output the maximum temperature of a certain year.
4. MKMeans Clustering Algorithm
means is a clustering algorithm based on partition. It is widely used in various cluster analyses. This algorithm has good clustering effect in data with spherical, convex distribution, but, for massive datasets, it encounters the bottleneck of efficiency in calculating the distance between objects. It is only guaranteed to converge to local optimum. Its clustering results are very sensitive to the choice of initial centroids. Most importantly, it is not efficient for processing massive data. In this section, we present how to adapt means in the parallel environment for big data.
Let us briefly review the means algorithm. Here is the formal description of means.
Given a set of data points and indicated clusters, the goal of means is to find the centroids , to minimize , where . In order to find the optimal centroids, the means algorithm initially randomly selects central points in the dimensional space. Then the means algorithm calculates the distances of each data point to the centroids, and assigns the data point to the closest centroid. After all data points are assigned to their closest centroid, the initial clusters are formed. For each cluster, means readjusts its centroid via computing the mean of each dimension of the data points in the cluster. Thus, the centroids are updated. With the updated centroids, means reassigns all the data points to each centroid again. This process repeats until no more changes of the assignments of all data points.
First, objects chosen from data objects are served as initial cluster centers; for the rest objects, the distances between each central point and all the rest points are not calculated during updating the central point circularly. Instead, the distance between a central point and all points is calculated based on the clustering result, taking the mean. Central points are obtained for the next cycle. The clustering process of means () is shown in Figure 3. From the figure, we can clearly see that the selected cluster centers are iteratively processed until the final stable status. Then as shown in the red circle, the clustering result is finalized. Thus, the classes obtained by clustering are assigned to each computer node, the central point of means is calculated by these nodes, and finally the central points are returned. The distance between all data and each central points is calculated to obtain clustering results. The means algorithm commonly uses Euclidean distance as the standard measure of similarity evaluation. The clustering effect of the objective function can be defined as: For example a specific number such as .
(a)
(b)
(c)
(d)
(e)
(f)
The means algorithm tries to find an optimal solution by minimizing the square error: where is the number of clusters, is the total number of data objects, is the center of the th cluster, and is the Euclidean distance between the sample and the center of the th cluster.
Definition 1. The definition of a set of the centroid points is . Let is a set point of the dimensional space, , if meets ; then it is claimed that be the ε approximate centroid point set of , where is the optimal value of the cluster results of means.
Definition 2. Let be the points of dimensional space, If the existence of real number to meet: , . For a point in the dimensional space, if is established, then is convex combination of points.
Theorem 3. To the fixedpoint set , for any a point ,
In order to take the advantages of high performance parallel computing in meteorological fields, we propose a fast means algorithm for weather information processing using the MapReduce model. The parallel workflow of the means algorithm is shown in Figure 4.
Figure 4 represents the running process of Parallel means based on a MapReduce execution. The MapReduce process first splits the data into segments [16]. Then the Map task generates a sequence of pairs from each segment, which are stored in HDFS files. For each Map task, the Parallel means constructs a global variant center of the clusters. Next the library runs many copies of the program on the Hadoop in a cluster. Then, the intermediate pairs are stored in the memory, and are shuffled and sort the pairs. Finally, the Reduce function task sums all samples and computes the total number of samples assigned to the same cluster. So, we can obtain the new centers which are used for next iteration.
Then, an means clustering algorithm for meteorological data proposed in this paper is shown in Algorithm 2.

Map Function. first constructs a global variable list center. Then it scans the sequence file of pairs as an input, and reads each line as a data sample. Then, it calculates the distance of the data samples stored in centers to each centroid, and chooses the minimum distance. The data samples are assigned to the cluster center, and set a value to the data samples. The pseudocode of the Map function is shown in Algorithm 3.

Mapfunction process according to the meteorological services is map output pair Worker 1: (. ), (. ), (. ). Worker 2: Worker 3:
Combine Function. Through each map task, it comes out a large amount of data. In order to reduce the burden of communication among different nodes, the combinefunction sums the value of the points assigned to the same cluster with the mean value, then passes the mean value to the reduce function to deal with. The pseudocode for the combinefunction is shown in Algorithm 4.

Reduce Function. First we obtain the mean value of the combined task from each node, and then combine the local mean value to the global mean value. Input values for Reduce are grouped from intermediate results automatically. To set a counter count in the combinefunction, both the Reduce function and the combinefunction can contact each other easily. The count also can record the number of data samples involved the mean value. The pseudocode for the Reduce function is shown in Algorithm 5.

5. Experimental Classification Results and Analysis
The means algorithm is deployed in the meteorological information data center to analyze the meteorological information at the Nanjing University of Information Science & Technology. The meteorological data is described in the following subsection.
5.1. Meteorological Dataset
All experiments are conducted on a set of meteorological datasets (http://www.atmosphere.csdb.cn/page/showEntity.vpage?uri=data_ziliao.haiyangziliao). Test data involves 4 years’ total factor mapping data of national reference climatological stations from 2004 to 2007 from a China Meteorological Data Sharing Service System (http://cdc.cma.gov.cn/). The data includes monitoring data at 2:00, 8:00, 14:00, and 20:00 every day of 753 national reference climatologically stations all over the whole country. In monitoring data, there are: total cloud amount, wind direction, wind speed, sea level pressure (or the site air pressure), 3 hours transformer, past weather 1, past weather 2, 6 hours of rain, low cloudlike, low cloud cover, low cloud high, dew point, visibility, present weather, temperature, cloudlike, high cloud, and other weather elements the default value of elements is 9999.
The data used in our experiments has 26 attributes: District station number (long integer), longitude, latitude, altitude (both floatingpoint), sitelevel (integer), total cloud cover, wind direction, wind speed, sea level pressure (or pressure site), 3 hours transformer, past weather 1, past weather 2, 6 hours of rain, low cloudlike, low cloud cover, low cloud high, dew point, visibility, present weather, temperature, cloudlike, high cloud, flag 1, flag 2 (all integers), and 24 hour variable temperature, 24hour transformer.
The dataset is a HDFS specified file in Hadoop. We have formed four datasets shown in Table 1. Dataset 1 is the meteorological data of 2007. Dataset 2 is the meteorological data from 2006 to 2007. Dataset 3 is the meteorological data from 2005 to 2007. Dataset 4 is the meteorological data from 2004 to 2007. The properties of the datasets are shown in Table 1. The datasets experimentally selected have the same characteristic, whose class attribute is numerical.
5.2. Experiment Platform
The experiment is conducted on nine PCs running an operating system CentOS5.4 (Red Hat Enterprise Linux 4.1.2). Each PC installs the related software, such as jdK1.6.0, Hadoop0.19.2 and Mahout0.3. The distributed cloud environment is based on Hadoop. Its nodes are divided into NameNode and DataNode (only one NameNode and multiple DataNodes). In the view of MapReduce, nodes can be divided into JobTracker and TaskTracker (only one JobTracker and multiple TaskTrackers). JobTracker and NameNode can be deployed on the same machine. The machine deployed NameNode and/or JobTracker is master, the rest are slavers.
In the experiment, nine PCs are used to build the cloud computing environment. Each PC uses the CPU of Intel Core 2.66 GHz, with 2 G RAM. Nine computers are connected through a 100 Mbps LAN switch. We have a label for each PC. “Aiken” is served as NameNode and JobTracker, sev136, sev138, sev144, sev145, sev148, sevl49, sevl54, and sevl55 are served as DataNodes and TaskTrackers. The directory (/etc/hosts) of each machine is configured. The IP of NameNode and JobTracker is configured under the directory conf/hadoopsite.xml. After the Hadoop cluster is built successfully, the information of each node and the information of the MapReduce tasks are shown in Figure 5.
The server “Aiken” can login each machine without password via sshkeygen. The key configuration items of Hadoop in the experiment environment are shown in Table 2. Each machine modifies conf/masters. Again, the IP of NameNode and jobtracker is configured under conf/hadoopsite.xml. The relevant parameters are modified by conf/hadoopdefault.xml, and conf/hadoopsite.xml.
During the experiment, we found an important factor: block size, which impacts the performance significantly. If the block division is set to too small, the job will increase the number of collaboration and increase the cost of reduced performance. Otherwise, it cannot maximize the benefit of parallel processing. So the block size for data processing should based on the amount of the real required size.
5.3. Experimental Results
To evaluate the performance of our proposed means algorithm for meteorological datasets, we use the running time, speedup, scaleup to validate it [19]. Speedup describes the performance of a parallel algorithm. It is like the reduced running time. As we know, the reduced run time is an important indicator to verify the performance of a parallel algorithm. Speedup is defined as: , where is the time it takes to solve the problem on a single machine, and is the time spent by a parallel algorithm in the same node for solving the same problem. With the increase of , the means algorithm can still maintain a linear growth status, then provides more nodes to shorten the time spent.
We first conduct the experiment on the four datasets described in Table 1. The intermediate results of each iteration are stored in the “clusters” folder, where “” is the number of clusters. The final clustering results are stored in the folder of points. The results are shown in Table 3.
The experimental results show that, the means algorithm is suitable for the actual situation. The running procedure of the means algorithm is stable and reliable, and the requirements of large data processing in the actual parallel and distributed environment can be satisfied.
Meanwhile, we further investigated the performance of the means algorithm with different number of nodes used in Hadoop. In addition, we also investigated the performance of the means algorithm with different size of datasets. The corresponding running time is shown in Figure 5.
Figure 6 shows that the running time of the means algorithm decreases with the corresponding increment of the number of nodes used. The dataset size is large, the more significant the running time reduces with the number of computer nodes. We also evaluate the performance of the means algorithm in terms of speedup and scaleup, shown in Figure 7. As we described before, speedup also measures the performance of the means algorithm. Besides, we also measure the scalability (Scaleup) of our algorithm. Evaluation of scaleup is to increase the number of nodes in expanding the same amount of data at the same time. Scaleup is defined as follows:
(a) Speedup
(b) Scaleup
If the value of scaleup is in the vicinity of 1, or less, with the change of , it means that the algorithm has very good adaptability on the dataset. The result of scaleup is shown in Figure 7(b).
In our experiment, the number of nodes varies from one to eight; the data size of the dataset increases from 1 G to 10.8 G. Figure 7(a) shows the speedup values for different number of nodes. It is shown that our algorithm has reasonable speedup performance. On four different size datasets, the speedup of our algorithm consistently goes up when more nodes are available. Then, as the size of the datasets increases, the speedup performs better. Figure 7(b) shows that how well the fast means algorithm deals with large datasets when more computer nodes are available. Obviously, the means algorithm has very good scalability. This system is deployed in the meteorological information data center to analyze the meteorological information at the Nanjing University of Information Science & Technology. In the realworld situation, it is stable and reliable, and meets the needs of analyzing the large meteorological data.
To validate the means algorithm for meteorological data efficiently, we have compared means with means [4]. The two algorithms, both with the MapReduce framework for clustering, are comparable.
In addition, the squarederror criterion is used to measure the result of clustering, defined as: where is the square error summation for all objects in the dataset, is a given object in cluster , and is the mean of cluster . The comparative evaluation of the square error between means and means [4] is shown in Figure 8.
From Figure 8, it is easy to notice that the square error of means is significantly lower than that of means. It shows that means can improve the stability of the means algorithm for meteorological data, and our means can partly solve the problem of the instability and sensitivity to outliers of means.
We also investigate the impact of the file size. In our experiments, we have two contrast datasets (dataset 5 and dataset 6 shown in Table 4), whose total sizes are the same (230 MB). Dataset 5 has 1217 small files, and its file size is between 250 KB and 500 KB; Dataset 6 has one large file (about 230 MB). Hadoop default data block size is 64 MB. The throughputs of the two different type datasets are shown in Table 4. From Table 4, we can see that the throughput of dataset 5 (with a large number of small files) is much less than dataset 6 (with one large file) in the system. Therefore, we can conclude that Hadoop has the advantage on handling large size files. This is because a lot of time is wasted on the process of reading and writing a large number of small files during the Map operation.
6. Conclusion
With the development of cloud computing, research on distributed parallel algorithms attracts more and more attention. There exist some parallel classification and clustering algorithms. However, an effective and cheap solution for processing the massive meteorological information is highly demanded. In this paper, we initiated a meteorological information processing system based on cloud computing and compared with some existing approaches. Then, we proposed a fast means clustering algorithm for analyzing meteorological information processing using MapReduce. After having built the Hadoop experimental platform, we investigated the performance of our means algorithm. Our experimental results show that our means algorithm deployed in the largescale meteorological data processing system is feasible and efficient. Next, we will further optimize the algorithm and integrate the system with other parallel and distributed algorithms into the system to meet with the challenge of Big Data.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This work is supported by the project of China Meteorological Administration Soft Science (no. SK20120151) and a Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.