Table of Contents Author Guidelines Submit a Manuscript
Scientific Programming
Volume 19 (2011), Issue 2-3, Pages 147-159

Large Science Databases – Are Cloud Services Ready for Them?

Ani Thakar,1 Alex Szalay,1 Ken Church,2 and Andreas Terzis3

1Department of Physics and Astronomy and the Institute for Data Intensive Engineering and Science, The Johns Hopkins University, Baltimore, MD, USA
2Human Language Technology Center of Excellence and IDIES, The Johns Hopkins University, Baltimore, MD, USA
3Department of Computer Science and IDIES, The Johns Hopkins University, Baltimore, MD, USA

Copyright © 2011 Hindawi Publishing Corporation. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


We report on attempts to put an astronomical database – the Sloan Digital Sky Survey science archive – in the cloud. We find that it is very frustrating to impossible at this time to migrate a complex SQL Server database into current cloud service offerings such as Amazon (EC2) and Microsoft (SQL Azure). Certainly it is impossible to migrate a large database in excess of a TB, but even with (much) smaller databases, the limitations of cloud services make it very difficult to migrate the data to the cloud without making changes to the schema and settings that would degrade performance and/or make the data unusable. Preliminary performance comparisons show a large performance discrepancy with the Amazon cloud version of the SDSS database. These difficulties suggest that much work and coordination needs to occur between cloud service providers and their potential clients before science databases – not just large ones but even smaller databases that make extensive use of advanced database features for performance and usability – can successfully and effectively be deployed in the cloud. We describe a powerful new computational instrument that we are developing in the interim – the Data-Scope – that will enable fast and efficient analysis of the largest (petabyte scale) scientific datasets.