CheCUDA: A checkpoint/restart tool for CUDA applications

Hiroyuki Takizawa, Katsuto Sato, Kazuhiko Komatsut, Hiroaki Kobayashit

Research output: Chapter in Book/Report/Conference proceedingConference contribution

58 Citations (Scopus)

Abstract

In this paper, a tool named CheCUDA is designed to checkpoint CUDA applications that use GPUs as accelerators. As existing checkpoint/restart implementations do not support checkpointing the GPU status, CheCUDA hooks a part of basic CUDA driver API calls in order to record the status changes on the main memory. At checkpointing, CheCUDA stores the status changes in a file after copying all necessary data in the video memory to the main memory and then disabling the CUDA runtime. At restarting, CheCUDA reads the file, re-initializes the CUDA runtime, and recovers the resources on GPUs so as to restart from the stored status. This paper demonstrates that a prototype implementation of CheCUDA can correctly checkpoint and restart a CUDA application written with basic APIs. This also indicates that CheCUDA can migrate a process from one PC to another even if the process uses a GPU. Accordingly, CheCUDA is useful not only to enhance the dependability of CUDA applications but also to enable dynamic task scheduling of CUDA applications required especially on heterogeneous GPU cluster systems. This paper also shows the timing overhead for checkpointing.

Original languageEnglish
Title of host publication2009 International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2009
Pages408-413
Number of pages6
DOIs
Publication statusPublished - 2009 Dec 1
Event2009 International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2009 - Higashi, Hiroshima, Japan
Duration: 2009 Dec 82009 Dec 11

Publication series

NameParallel and Distributed Computing, Applications and Technologies, PDCAT Proceedings

Other

Other2009 International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2009
CountryJapan
CityHigashi, Hiroshima
Period09/12/809/12/11

Keywords

  • Checkpoint/restart
  • Compute unified device architecture
  • Graphics processing units

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture
  • Computer Science Applications

Fingerprint Dive into the research topics of 'CheCUDA: A checkpoint/restart tool for CUDA applications'. Together they form a unique fingerprint.

Cite this