Table of Contents Author Guidelines Submit a Manuscript
Scientific Programming
Volume 2016 (2016), Article ID 9315493, 11 pages
http://dx.doi.org/10.1155/2016/9315493
Research Article

Optimizing Checkpoint Restart with Data Deduplication

College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China

Received 1 March 2016; Accepted 5 May 2016

Academic Editor: Laurence T. Yang

Copyright © 2016 Zhengyu Chen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

The increasing scale, such as the size and complexity, of computer systems brings more frequent occurrences of hardware or software faults; thus fault-tolerant techniques become an essential component in high-performance computing systems. In order to achieve the goal of tolerating runtime faults, checkpoint restart is a typical and widely used method. However, the exploding sizes of checkpoint files that need to be saved to external storage pose a major scalability challenge, necessitating the design of efficient approaches to reducing the amount of checkpointing data. In this paper, we first motivate the need of redundancy elimination with a detailed analysis of checkpoint data from real scenarios. Based on the analysis, we apply inline data deduplication to achieve the objective of reducing checkpoint size. We use DMTCP, an open-source checkpoint restart package, to validate our method. Our experiment shows that, by using our method, single-computer programs can reduce the size of checkpoint file by 20% and distributed programs can reduce the size of checkpoint file by 47%.