Table of Contents Author Guidelines Submit a Manuscript
Scientific Programming
Volume 19, Issue 1, Pages 27-43
http://dx.doi.org/10.3233/SPR-2011-0317

PetaShare: A Reliable, Efficient and Transparent Distributed Storage Management System

Tevfik Kosar,1,2,3 Ismail Akturk,4 Mehmet Balman,5 and Xinqi Wang1,2

1Department of Computer Science and Engineering, State University of New York, Buffalo, NY, USA
2Department of Computer Science, Louisiana State University, Baton Rouge, LA, USA
3Center for Computation and Technology, Louisiana State University, Baton Rouge, LA, USA
4Department of Computer Engineering, Bilkent University, Ankara, Turkey
5Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA

Copyright © 2011 Hindawi Publishing Corporation. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Modern collaborative science has placed increasing burden on data management infrastructure to handle the increasingly large data archives generated. Beside functionality, reliability and availability are also key factors in delivering a data management system that can efficiently and effectively meet the challenges posed and compounded by the unbounded increase in the size of data generated by scientific applications. We have developed a reliable and efficient distributed data storage system, PetaShare, which spans multiple institutions across the state of Louisiana. At the back-end, PetaShare provides a unified name space and efficient data movement across geographically distributed storage sites. At the front-end, it provides light-weight clients the enable easy, transparent and scalable access. In PetaShare, we have designed and implemented an asynchronously replicated multi-master metadata system for enhanced reliability and availability, and an advanced buffering system for improved data transfer performance. In this paper, we present the details of our design and implementation, show performance results, and describe our experience in developing a reliable and efficient distributed data management system for data-intensive science.