Scientific Programming

Scientific Programming / 2013 / Article
Special Issue

Selected Papers from Super Computing 2012

View this Special Issue

Open Access

Volume 21 |Article ID 473915 | https://doi.org/10.3233/SPR-130374

Jinsuk Chung, Ikhwan Lee, Michael Sullivan, Jee Ho Ryoo, Dong Wan Kim, Doe Hyun Yoon, Larry Kaplan, Mattan Erez, "Containment Domains: A Scalable, Efficient and Flexible Resilience Scheme for Exascale Systems", Scientific Programming, vol. 21, Article ID 473915, 16 pages, 2013. https://doi.org/10.3233/SPR-130374

Containment Domains: A Scalable, Efficient and Flexible Resilience Scheme for Exascale Systems

Abstract

This paper describes and evaluates a scalable and efficient resilience scheme based on the concept of containment domains. Containment domains are a programming construct that enable applications to express resilience needs and to interact with the system to tune and specialize error detection, state preservation and restoration, and recovery schemes. Containment domains have weak transactional semantics and are nested to take advantage of the machine and application hierarchies and to enable hierarchical state preservation, restoration and recovery. We evaluate the scalability and efficiency of containment domains using generalized trace-driven simulation and analytical analysis and show that containment domains are superior to both checkpoint restart and redundant execution approaches.

Copyright © 2013 Hindawi Publishing Corporation. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


More related articles

 PDF Download Citation Citation
 Order printed copiesOrder
Views471
Downloads537
Citations

Article of the Year Award: Outstanding research contributions of 2020, as selected by our Chief Editors. Read the winning articles.