Abstract

In the atmospheric science, the scale of meteorological data is massive and growing rapidly. K-means is a fast and available cluster algorithm which has been used in many fields. However, for the large-scale meteorological data, the traditional K-means algorithm is not capable enough to satisfy the actual application needs efficiently. This paper proposes an improved MK-means algorithm (MK-means) based on MapReduce according to characteristics of large meteorological datasets. The experimental results show that MK-means has more computing ability and scalability.

1. Introduction

In the atmospheric sciences, meteorological data is extremely rich and valued, which requires a mass of scientific computing, and provides services to the communities. With the further expansion of meteorological services and the improvement of the modernization standard in meteorology, a large amount of meteorological information has been accumulated and collected in meteorological services, research and management activities. High-performance computers are required to process these data, but small organizations and units cannot afford the high price of high-performance computers. Cloud Computing technology provides the cheap computing services for the Meteorological Organization with higher efficiency, lower cost, and lower carbon. Climate data are dramatically increasing in volume and complexity, since users of these data in the scientific community and the public are rapidly increasing [1]. Faced to such large-scale meteorological data, high-efficient computing power (more than a trillion times) is urgently required. Therefore, establishing a cloud computing weather information processing system is very important and significant.

MapReduce is a key technology of using cloud computing to process a large amount of data. It is a parallel programming model and an associated implementation for processing and generating large datasets in a broad variety of real world tasks proposed by Google. It is not only a programming model, but also a task scheduling model. It is compose of two fundamental functions: Map and Reduce, defined by users. A Map function is to handle a key/value pair to produce intermediate key/value pair. A Reduce function is specified to combine all of the intermediate value with the same middle key [2]. MapReduce is typically used to perform distributed computing on clusters of computers. Google’s MapReduce abstracts the distributed computing from its complex details; such that programmers can handle large distributed system resources without any experience about a parallel or distributed system. Thereby, the effect originally achieved only by expensive high-performance computer can be achieved by low-cost computing services.

As we know, not all data mining algorithms can be parallelized to handle large datasets at this moment. Some algorithms cannot be parallelized in theory. Some need to be adapted to take the advantage of the efficiency of parallelization. In this paper, we utilize the -means algorithm in the MapReduce framework. Specifically, we adapt the -means algorithm in an open-source software framework: Hadoop, and apply the parallel -means algorithm (-means) to cluster the large-scale weather data.

This paper is organized as follows. Related work is reviewed in Section 2. In Section 3, we introduce the MapReduce programming model. In Section 4, we describe our parallel -means algorithm (-means) for large-scale meteorological data using MapReduce. In Section 5, we conduct the experiment to evaluate the -means algorithm by applying it to cluster large-scale meteorological data. Finally, we conclude the paper in Section 6.

In recent years, there is significant research in -means clustering and MapReduce. -means clustering problem has been well studied in data mining research and related fields. -means is one of the top 10 algorithms in data mining [3]. Its simplicity and speed allow it to run on large datasets. With the development of information technology, the volume of information is becoming more and more enlarging. MapReduce is a quite novel programming model for solving certain kinds of distributable problems and processing large datasets [2]. So, to deal with high dimensions and large datasets, some researchers have proposed some methods to solve these problems [48]. Böse et al. [9] implemented several incremental data mining algorithms including Naïve Bayes and PCA and applied their methods to deal with large-scale datasets. Chu et al. [10] realized a few algorithms based on MapReduce, such as SVM, ICA, PCA, Gaussian Discriminant Analysis, EM and Backpropagation. Chao et al. [6] proposed a parallel Co-means algorithm based on MapReduce, which basically distributes the clustering load over a given number of processors. Reference [8] adapts an ensemble learning method-bagging to overcome the instability and sensitivity to outliers in clustering on large datasets. There has been work on developing algorithms and approximation algorithms that fit into the MapReduce [11]. Apache Hadoop [2] is a free Java MapReduce framework that allows the parallel or distributed processing of large datasets. Zhao et al. [4] presented a fast parallel -means clustering algorithm based on the MapReduce framework; however, their approach does not consider the characteristics of large meteorological datasets and cannot achieve good results. Reference [12] demonstrated the utility of the -means clustering algorithm for identifying relationships between winds at turbine heights and climate oscillations, thereby developing a method for predicting the impacts of climate changes on wind resources. However, only a few studies on dealing with the large-scale meteorological data using MapReduce have been reported.

In this paper, we present a parallel clustering algorithm -means which is based on both -means and MapReduce for very large meteorological data.

3. MapReduce Overview

As said before, MapReduce is developed by Google. Its libraries have been written in many programming languages, such as Java, Python, and C++ [1316]. It is mainly used to process large-scale (TB-level) data files. MapReduce is not only a simplified programming model, but an efficient distributed scheduling model. Programming is very simple in such a cloud computing environment. The treatment of clusters is handled by the platform, including the reliability and scalability [17]. Application developers only need to focus on the application itself. “Map” and “Reduce” are the two basic computing units of the MapReduce model. Massive data is cut into unrelated blocks by Map program, and scheduled to lots of computers to process, achieving distributed computing. Then the results from these computers are summarized and outputted by Reduce program.

In MapReduce, massive data is processed in parallel. Data is initially partitioned across the nodes (computers) of a cluster and stored in a distributed file system (DFS). Data is represented as (key, value) pairs. The computation of the two functions is expressed formally as follows [5]:, .

The Google’s MapReduce programming model is shown in Figure 1.

To further understand the MapReduce programming model, the pseudocode of program based on MapReduce is shown in Algorithm 1. The program is used to calculate the annual maximum temperature [18].

map(String input_key, String input_value):
 //input_key: document name
 //input_value: document contents
 for each year and temperature in input_value:
EmitIntermediate( , );
reduce(String output_key, Interator intermediate_values):
 // output_key: year
 // intermediate_values: a list of temperature
  int maxValue = Interger.MIN_VALUE;
  for each in intermediate_values:
  maxValue = Math.max( );
 Emit(year, maxValue);

A Map function is used to extract all the years and temperatures (key/value pairs) appeared in text, and these pairs are sent to an intermediate temporary space specified by MapReduce. Through intermediate processing by the Map function, the key/value pairs are grouped according to the key, so that each year is followed by a list of temperatures. Then, a Reduce function is only to find the maximum number through a whole list. The result is the annual maximum temperature.

Figure 2 shows the intermediate results of each step of the execution process of MapReduce, including Map and Reduce phases, which both use all nodes in the cluster. Between the Map and Reduce phases, there is an intermediate phase, which concatenates the intermediate results with the same key into a list. The list will be used by the Reduce function to output the maximum temperature of a certain year.

4. MK-Means Clustering Algorithm

-means is a clustering algorithm based on partition. It is widely used in various cluster analyses. This algorithm has good clustering effect in data with spherical, convex distribution, but, for massive datasets, it encounters the bottleneck of efficiency in calculating the distance between objects. It is only guaranteed to converge to local optimum. Its clustering results are very sensitive to the choice of initial centroids. Most importantly, it is not efficient for processing massive data. In this section, we present how to adapt -means in the parallel environment for big data.

Let us briefly review the -means algorithm. Here is the formal description of -means.

Given a set of data points and indicated clusters, the goal of -means is to find the centroids , to minimize , where . In order to find the optimal centroids, the -means algorithm initially randomly selects central points in the -dimensional space. Then the -means algorithm calculates the distances of each data point to the centroids, and assigns the data point to the closest centroid. After all data points are assigned to their closest centroid, the initial clusters are formed. For each cluster, -means readjusts its centroid via computing the mean of each dimension of the data points in the cluster. Thus, the centroids are updated. With the updated centroids, -means reassigns all the data points to each centroid again. This process repeats until no more changes of the assignments of all data points.

First, objects chosen from data objects are served as initial cluster centers; for the rest objects, the distances between each central point and all the rest points are not calculated during updating the central point circularly. Instead, the distance between a central point and all points is calculated based on the clustering result, taking the mean. Central points are obtained for the next cycle. The clustering process of -means () is shown in Figure 3. From the figure, we can clearly see that the selected cluster centers are iteratively processed until the final stable status. Then as shown in the red circle, the clustering result is finalized. Thus, the classes obtained by clustering are assigned to each computer node, the central point of -means is calculated by these nodes, and finally the central points are returned. The distance between all data and each central points is calculated to obtain clustering results. The -means algorithm commonly uses Euclidean distance as the standard measure of similarity evaluation. The clustering effect of the objective function can be defined as: For example a specific number such as .

The -means algorithm tries to find an optimal solution by minimizing the square error: where is the number of clusters, is the total number of data objects, is the center of the th cluster, and is the Euclidean distance between the sample and the center of the th cluster.

Definition 1. The definition of a set of the centroid points is . Let is a set point of the -dimensional space, , if meets ; then it is claimed that be the ε approximate centroid point set of , where is the optimal value of the cluster results of -means.

Definition 2. Let be the points of -dimensional space, If the existence of real number to meet:   , . For a point in the -dimensional space, if is established, then is convex combination of points.

Theorem 3. To the fixed-point set , for any a point ,

In order to take the advantages of high performance parallel computing in meteorological fields, we propose a fast -means algorithm for weather information processing using the MapReduce model. The parallel workflow of the -means algorithm is shown in Figure 4.

Figure 4 represents the running process of Parallel -means based on a MapReduce execution. The MapReduce process first splits the data into segments [16]. Then the Map task generates a sequence of pairs from each segment, which are stored in HDFS files. For each Map task, the Parallel -means constructs a global variant center of the clusters. Next the library runs many copies of the program on the Hadoop in a cluster. Then, the intermediate pairs are stored in the memory, and are shuffled and sort the pairs. Finally, the Reduce function task sums all samples and computes the total number of samples assigned to the same cluster. So, we can obtain the new centers which are used for next iteration.

Then, an -means clustering algorithm for meteorological data proposed in this paper is shown in Algorithm 2.

Input: data of each automatic station
File 1: data of automatic station 1
File 2: data of automatic station 2
File : data of automatic station
Output: pair, key ' is the intermediate value of the clustering,
while the value' is intermediate value associated with the same key'
the sample distance is calculated based on Value;
the minimum distance is repeatedly calculated based on the method of
center of mass;

Map Function. first constructs a global variable list center. Then it scans the sequence file of pairs as an input, and reads each line as a data sample. Then, it calculates the distance of the data samples stored in centers to each centroid, and chooses the minimum distance. The data samples are assigned to the cluster center, and set a value to the data samples. The pseudocode of the Map function is shown in Algorithm 3.

// map(key, value)
Input: , Global variable centers, the offset key, the sample value
Output: , where key′ is the index of the closet center point near the value′.
(1) Construct a global variable centers, then assign the center point to centers.
(2) minDistance = MAX_VALUE;
(3) For ( ; <= centers.length; ++)
{
  If (distance(point, centers[ ]) < minDistance)
   minDistance = distance(point, centers[ ]);
   index =
(4) }
(5) Key′ = nearestCluster
(6) Output (key′, value′)
(7) End

Mapfunction process according to the meteorological services ismap output pairWorker 1: (. ), (. ), (. ).Worker 2:Worker 3:

Combine Function. Through each map task, it comes out a large amount of data. In order to reduce the burden of communication among different nodes, the combine-function sums the value of the points assigned to the same cluster with the mean value, then passes the mean value to the reduce function to deal with. The pseudocode for the combine-function is shown in Algorithm 4.

//Combine(key, points)
Input: , where key is the index of cluster, points is the list
of the samples assigned to the same cluster
Output: , where value′ is a string of new cluster centers.
(1) num = 0;
(2) While (points.hasNext()){
currentPoint = points.next();
num++;
For ( ; < dimensions; ++){
sum[ ]+ = currentPoint.point[ ];
(3)}
(4) For ( ; < dimentsions; ++)
 mean[ ] = sum[ ]/num;
(5) output pair;
(6) End

Reduce Function. First we obtain the mean value of the combined task from each node, and then combine the local mean value to the global mean value. Input values for Reduce are grouped from intermediate results automatically. To set a counter count in the combine-function, both the Reduce function and the combine-function can contact each other easily. The count also can record the number of data samples involved the mean value. The pseudocode for the Reduce function is shown in Algorithm 5.

// Reduce(key, points)
Input: key is the index of the cluster centers, points is the list of the partial sums
from different centers.
Output: , where key′ is the index of the closet center point,
Value′ is the new center.
(1) Num = 0;
(2) While (points.hasNext()){
  currentPoint = points.next();
  Num+ = currentPoint.getnum;
  For ( ; <dimensions; ++){
  sum[ ]+ = currentPoint.point[ ];
(3) }
(4) For ( ; <dimentsions; ++)
 mean[ ] = sum[ ]/Num;
(5) output pair;
(6) End

5. Experimental Classification Results and Analysis

The -means algorithm is deployed in the meteorological information data center to analyze the meteorological information at the Nanjing University of Information Science & Technology. The meteorological data is described in the following subsection.

5.1. Meteorological Dataset

All experiments are conducted on a set of meteorological datasets (http://www.atmosphere.csdb.cn/page/showEntity.vpage?uri=data_ziliao.haiyangziliao). Test data involves 4 years’ total factor mapping data of national reference climatological stations from 2004 to 2007 from a China Meteorological Data Sharing Service System (http://cdc.cma.gov.cn/). The data includes monitoring data at 2:00, 8:00, 14:00, and 20:00 every day of 753 national reference climatologically stations all over the whole country. In monitoring data, there are: total cloud amount, wind direction, wind speed, sea level pressure (or the site air pressure), 3 hours transformer, past weather 1, past weather 2, 6 hours of rain, low cloud-like, low cloud cover, low cloud high, dew point, visibility, present weather, temperature, cloud-like, high cloud, and other weather elements the default value of elements is 9999.

The data used in our experiments has 26 attributes: District station number (long integer), longitude, latitude, altitude (both floating-point), site-level (integer), total cloud cover, wind direction, wind speed, sea level pressure (or pressure site), 3 hours transformer, past weather 1, past weather 2, 6 hours of rain, low cloud-like, low cloud cover, low cloud high, dew point, visibility, present weather, temperature, cloud-like, high cloud, flag 1, flag 2 (all integers), and 24 hour variable temperature, 24-hour transformer.

The dataset is a HDFS specified file in Hadoop. We have formed four datasets shown in Table 1. Dataset 1 is the meteorological data of 2007. Dataset 2 is the meteorological data from 2006 to 2007. Dataset 3 is the meteorological data from 2005 to 2007. Dataset 4 is the meteorological data from 2004 to 2007. The properties of the datasets are shown in Table 1. The datasets experimentally selected have the same characteristic, whose class attribute is numerical.

5.2. Experiment Platform

The experiment is conducted on nine PCs running an operating system CentOS5.4 (Red Hat Enterprise Linux 4.1.2). Each PC installs the related software, such as jdK-1.6.0, Hadoop-0.19.2 and Mahout-0.3. The distributed cloud environment is based on Hadoop. Its nodes are divided into NameNode and DataNode (only one NameNode and multiple DataNodes). In the view of MapReduce, nodes can be divided into JobTracker and TaskTracker (only one JobTracker and multiple TaskTrackers). JobTracker and NameNode can be deployed on the same machine. The machine deployed NameNode and/or JobTracker is master, the rest are slavers.

In the experiment, nine PCs are used to build the cloud computing environment. Each PC uses the CPU of Intel Core 2.66 GHz, with 2 G RAM. Nine computers are connected through a 100 Mbps LAN switch. We have a label for each PC. “Aiken” is served as NameNode and JobTracker, sev136, sev138, sev144, sev145, sev148, sevl49, sevl54, and sevl55 are served as DataNodes and TaskTrackers. The directory (/etc/hosts) of each machine is configured. The IP of NameNode and JobTracker is configured under the directory conf/hadoop-site.xml. After the Hadoop cluster is built successfully, the information of each node and the information of the MapReduce tasks are shown in Figure 5.

The server “Aiken” can login each machine without password via ssh-keygen. The key configuration items of Hadoop in the experiment environment are shown in Table 2. Each machine modifies conf/masters. Again, the IP of NameNode and jobtracker is configured under conf/hadoop-site.xml. The relevant parameters are modified by conf/hadoop-default.xml, and conf/hadoop-site.xml.

During the experiment, we found an important factor: block size, which impacts the performance significantly. If the block division is set to too small, the job will increase the number of collaboration and increase the cost of reduced performance. Otherwise, it cannot maximize the benefit of parallel processing. So the block size for data processing should based on the amount of the real required size.

5.3. Experimental Results

To evaluate the performance of our proposed -means algorithm for meteorological datasets, we use the running time, speedup, scaleup to validate it [19]. Speedup describes the performance of a parallel algorithm. It is like the reduced running time. As we know, the reduced run time is an important indicator to verify the performance of a parallel algorithm. Speedup is defined as: , where is the time it takes to solve the problem on a single machine, and is the time spent by a parallel algorithm in the same node for solving the same problem. With the increase of , the -means algorithm can still maintain a linear growth status, then provides more nodes to shorten the time spent.

We first conduct the experiment on the four datasets described in Table 1. The intermediate results of each iteration are stored in the “clusters-” folder, where “” is the number of clusters. The final clustering results are stored in the folder of points. The results are shown in Table 3.

The experimental results show that, the -means algorithm is suitable for the actual situation. The running procedure of the -means algorithm is stable and reliable, and the requirements of large data processing in the actual parallel and distributed environment can be satisfied.

Meanwhile, we further investigated the performance of the -means algorithm with different number of nodes used in Hadoop. In addition, we also investigated the performance of the -means algorithm with different size of datasets. The corresponding running time is shown in Figure 5.

Figure 6 shows that the running time of the -means algorithm decreases with the corresponding increment of the number of nodes used. The dataset size is large, the more significant the running time reduces with the number of computer nodes. We also evaluate the performance of the -means algorithm in terms of speedup and scaleup, shown in Figure 7. As we described before, speedup also measures the performance of the -means algorithm. Besides, we also measure the scalability (Scaleup) of our algorithm. Evaluation of scaleup is to increase the number of nodes in expanding the same amount of data at the same time. Scaleup is defined as follows:

If the value of scaleup is in the vicinity of 1, or less, with the change of , it means that the algorithm has very good adaptability on the dataset. The result of scaleup is shown in Figure 7(b).

In our experiment, the number of nodes varies from one to eight; the data size of the dataset increases from 1 G to 10.8 G. Figure 7(a) shows the speedup values for different number of nodes. It is shown that our algorithm has reasonable speedup performance. On four different size datasets, the speedup of our algorithm consistently goes up when more nodes are available. Then, as the size of the datasets increases, the speedup performs better. Figure 7(b) shows that how well the fast -means algorithm deals with large datasets when more computer nodes are available. Obviously, the -means algorithm has very good scalability. This system is deployed in the meteorological information data center to analyze the meteorological information at the Nanjing University of Information Science & Technology. In the real-world situation, it is stable and reliable, and meets the needs of analyzing the large meteorological data.

To validate the -means algorithm for meteorological data efficiently, we have compared -means with -means [4]. The two algorithms, both with the Map-Reduce framework for clustering, are comparable.

In addition, the squared-error criterion is used to measure the result of clustering, defined as: where is the square error summation for all objects in the dataset, is a given object in cluster , and is the mean of cluster . The comparative evaluation of the square error between -means and -means [4] is shown in Figure 8.

From Figure 8, it is easy to notice that the square error of -means is significantly lower than that of -means. It shows that -means can improve the stability of the -means algorithm for meteorological data, and our -means can partly solve the problem of the instability and sensitivity to outliers of -means.

We also investigate the impact of the file size. In our experiments, we have two contrast datasets (dataset 5 and dataset 6 shown in Table 4), whose total sizes are the same (230 MB). Dataset 5 has 1217 small files, and its file size is between 250 KB and 500 KB; Dataset 6 has one large file (about 230 MB). Hadoop default data block size is 64 MB. The throughputs of the two different type datasets are shown in Table 4. From Table 4, we can see that the throughput of dataset 5 (with a large number of small files) is much less than dataset 6 (with one large file) in the system. Therefore, we can conclude that Hadoop has the advantage on handling large size files. This is because a lot of time is wasted on the process of reading and writing a large number of small files during the Map operation.

6. Conclusion

With the development of cloud computing, research on distributed parallel algorithms attracts more and more attention. There exist some parallel classification and clustering algorithms. However, an effective and cheap solution for processing the massive meteorological information is highly demanded. In this paper, we initiated a meteorological information processing system based on cloud computing and compared with some existing approaches. Then, we proposed a fast -means clustering algorithm for analyzing meteorological information processing using MapReduce. After having built the Hadoop experimental platform, we investigated the performance of our -means algorithm. Our experimental results show that our -means algorithm deployed in the large-scale meteorological data processing system is feasible and efficient. Next, we will further optimize the algorithm and integrate the system with other parallel and distributed algorithms into the system to meet with the challenge of Big Data.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is supported by the project of China Meteorological Administration Soft Science (no. SK20120151) and a Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.