Improving the scalability of transparent checkpointing for GPU computing systems

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

As the number of nodes in a GPU computing system increases, checkpointing to a global file system becomes more time-consuming due to the I/O bottlenecks and network congestion. To solve this problem, in this paper, we propose a transparent and scalable checkpoint/restart mechanism for OpenCL applications, named Two-level CheCL. As its name implies, Two-level CheCL consists of two different checkpoint implementations, Local CheCL and Global CheCL. Local CheCL avoids checkpointing to the global file system by utilizing node's local storage. Our experimental results show that Local CheCL can accelerate the checkpointing process by up to four times faster than a conventional checkpointing mechanism. We also implement Global CheCL, which utilizes a global file system, to make sure that we always have a global checkpoint file even in the case of a catastrophic failure. We discuss the performance of our proposed mechanism through an analysis with a two-level checkpoint model.

Original languageEnglish
Title of host publicationIEEE TENCON 2012
Subtitle of host publicationSustainable Development Through Humanitarian Technology
DOIs
Publication statusPublished - 2012 Dec 1
Event2012 IEEE Region 10 Conference: Sustainable Development Through Humanitarian Technology, TENCON 2012 - Cebu, Philippines
Duration: 2012 Nov 192012 Nov 22

Publication series

NameIEEE Region 10 Annual International Conference, Proceedings/TENCON

Other

Other2012 IEEE Region 10 Conference: Sustainable Development Through Humanitarian Technology, TENCON 2012
CountryPhilippines
CityCebu
Period12/11/1912/11/22

ASJC Scopus subject areas

  • Computer Science Applications
  • Electrical and Electronic Engineering

Fingerprint Dive into the research topics of 'Improving the scalability of transparent checkpointing for GPU computing systems'. Together they form a unique fingerprint.

Cite this