Optimizing Energy Consumption on HPC Systems with a Multi-Level Checkpointing Mechanism

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

Coordinated checkpointing is a widely-used checkpoint/restart (CPR) technique for fault-tolerance in large-scale HPC systems. However, this CPR technique will involve massive amounts of I/O concentration, resulting in considerably high checkpoint overhead and high energy consumption. This paper focuses on multi-level checkpointing that allows the use of different kinds of fast but less reliable storages to reduce the checkpointing frequency to parallel file system (PFS). This paper presents an energy model of multi-level checkpointing and proposes an iterative algorithm that minimizes energy consumption by optimizing the checkpoint interval of each level and selecting the best combination of checkpoint levels. It is confirmed that the algorithm is very fast and effective since it can reach convergence in a relatively small number of iteration steps. This paper also clarifies the fact that it is actually unnecessary to use all the available checkpoint levels in a multi-level CPR mechanism. By selectively using only appropriate checkpoint levels, a significant increase in energy efficiency (9 to 21%) is observed.

Original languageEnglish
Title of host publication2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781538634868
DOIs
Publication statusPublished - 2017 Sep 6
Event2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017 - Shenzhen, China
Duration: 2017 Aug 72017 Aug 9

Publication series

Name2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017 - Proceedings

Other

Other2017 IEEE International Conference on Networking, Architecture, and Storage, NAS 2017
CountryChina
CityShenzhen
Period17/8/717/8/9

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture

Fingerprint Dive into the research topics of 'Optimizing Energy Consumption on HPC Systems with a Multi-Level Checkpointing Mechanism'. Together they form a unique fingerprint.

Cite this