Table of Contents Author Guidelines Submit a Manuscript
Journal of Healthcare Engineering
Volume 2017, Article ID 1425102, 12 pages
https://doi.org/10.1155/2017/1425102
Research Article

Handling Data Skew in MapReduce Cluster by Using Partition Tuning

1College of Information Science and Technology, Beijing Normal University, Beijing, China
2Department of Industrial Engineering, Pusan National University, Pusan, Republic of Korea
3Cooperative Innovation Center of Internet Healthcare, Henan Province, China
4School of Information Engineering, Zhengzhou University, Zhengzhou, China
5Beijing Advanced Innovation Center for Future Education, Beijing Normal University, Beijing, China

Correspondence should be addressed to Jiacai Zhang; nc.ude.unb@gnahz.iacaij

Received 31 October 2016; Revised 2 January 2017; Accepted 19 February 2017; Published 29 March 2017

Academic Editor: Chase Wu

Copyright © 2017 Yufei Gao et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

The healthcare industry has generated large amounts of data, and analyzing these has emerged as an important problem in recent years. The MapReduce programming model has been successfully used for big data analytics. However, data skew invariably occurs in big data analytics and seriously affects efficiency. To overcome the data skew problem in MapReduce, we have in the past proposed a data processing algorithm called Partition Tuning-based Skew Handling (PTSH). In comparison with the one-stage partitioning strategy used in the traditional MapReduce model, PTSH uses a two-stage strategy and the partition tuning method to disperse key-value pairs in virtual partitions and recombines each partition in case of data skew. The robustness and efficiency of the proposed algorithm were tested on a wide variety of simulated datasets and real healthcare datasets. The results showed that PTSH algorithm can handle data skew in MapReduce efficiently and improve the performance of MapReduce jobs in comparison with the native Hadoop, Closer, and locality-aware and fairness-aware key partitioning (LEEN). We also found that the time needed for rule extraction can be reduced significantly by adopting the PTSH algorithm, since it is more suitable for association rule mining (ARM) on healthcare data.