Table of Contents Author Guidelines Submit a Manuscript
Scientific Programming
Volume 15, Issue 4, Pages 249-268
http://dx.doi.org/10.1155/2007/701609

Optimizing Workflow Data Footprint

Gurmeet Singh,1 Karan Vahi,1 Arun Ramakrishnan,1 Gaurang Mehta,1 Ewa Deelman,1 Henan Zhao,2 Rizos Sakellariou,2 Kent Blackburn,3 Duncan Brown,3,4 Stephen Fairhurst,3,5 David Meyers,3 G. Bruce Berriman,6 John Good,6 and Daniel S. Katz7

1USC Information Sciences Institute, 4676 Admiralty Way, Marina Del Rey, CA 90292, USA
2School of Computer Science, University of Manchester, Manchester M13 9PL, UK
3LIGO Laboratory, California Institute of Technology, MS 18-34, Pasadena, CA 91125, USA
4Theoretical Astrophysics, California Institute of Technology, MS, 130-33, Pasadena, CA 91125, USA
5Physics Department, University of Wisconsin-Milwaukee, Milwaukee, WI 53202, USA
6Infrared Processing and Analysis Center, California Institute of Technology, CA 91125, USA
7Center for Computation and Technology, Louisiana State University, Baton Rouge, LA 70803, USA

Received 30 November 2007; Accepted 30 November 2007

Copyright © 2007 Hindawi Publishing Corporation. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

In this paper we examine the issue of optimizing disk usage and scheduling large-scale scientific workflows onto distributed resources where the workflows are data-intensive, requiring large amounts of data storage, and the resources have limited storage resources. Our approach is two-fold: we minimize the amount of space a workflow requires during execution by removing data files at runtime when they are no longer needed and we demonstrate that workflows may have to be restructured to reduce the overall data footprint of the workflow. We show the results of our data management and workflow restructuring solutions using a Laser Interferometer Gravitational-Wave Observatory (LIGO) application and an astronomy application, Montage, running on a large-scale production grid-the Open Science Grid. We show that although reducing the data footprint of Montage by 48% can be achieved with dynamic data cleanup techniques, LIGO Scientific Collaboration workflows require additional restructuring to achieve a 56% reduction in data space usage. We also examine the cost of the workflow restructuring in terms of the application's runtime.