TY - CHAP
T1 - Automatic tuning of CUDA execution parameters for stencil processing
AU - Sato, Katsuto
AU - Takizawa, Hiroyuki
AU - Komatsu, Kazuhiko
AU - Kobayashi, Hiroaki
PY - 2010/12/1
Y1 - 2010/12/1
N2 - Recently, Compute Unified Device Architecture (CUDA) has enabled Graphics Processing Units (GPUs) to accelerate various applications. However, to exploit the GPU's computing power fully, a programmer has to carefully adjust some CUDA execution parameters even for simple stencil processing kernels. Hence, this paper develops an automatic parameter tuning mechanism based on profiling to predict the optimal execution parameters. This paper first discusses the scope of the parameter exploration space determined by GPU's architectural restrictions. To find the optimal execution parameters, performance models are created by profiling execution times of kernel using each promising parameter configuration. The execution parameters are determined by using those performance models. This paper evaluates the performance improvement due to the proposed mechanism using two benchmark programs. From the evaluation results, it is clarified that the proposed mechanism can appropriately select a suboptimal Cooperative Thread Array (CTA) configuration whose performance is comparable to the optimal one.
AB - Recently, Compute Unified Device Architecture (CUDA) has enabled Graphics Processing Units (GPUs) to accelerate various applications. However, to exploit the GPU's computing power fully, a programmer has to carefully adjust some CUDA execution parameters even for simple stencil processing kernels. Hence, this paper develops an automatic parameter tuning mechanism based on profiling to predict the optimal execution parameters. This paper first discusses the scope of the parameter exploration space determined by GPU's architectural restrictions. To find the optimal execution parameters, performance models are created by profiling execution times of kernel using each promising parameter configuration. The execution parameters are determined by using those performance models. This paper evaluates the performance improvement due to the proposed mechanism using two benchmark programs. From the evaluation results, it is clarified that the proposed mechanism can appropriately select a suboptimal Cooperative Thread Array (CTA) configuration whose performance is comparable to the optimal one.
UR - http://www.scopus.com/inward/record.url?scp=84887442287&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84887442287&partnerID=8YFLogxK
U2 - 10.1007/978-1-4419-6935-4_13
DO - 10.1007/978-1-4419-6935-4_13
M3 - Chapter
AN - SCOPUS:84887442287
SN - 9781441969347
SP - 209
EP - 228
BT - Software Automatic Tuning
PB - Springer New York
ER -