TY - GEN
T1 - ClMPI
T2 - 2013 IEEE 37th Annual Computer Software and Applications Conference, COMPSAC 2013
AU - Takizawa, Hiroyuki
AU - Sugawara, Makoto
AU - Hirasawa, Shoichi
AU - Gelado, Isaac
AU - Kobayashi, Hiroaki
AU - Hwu, Wen Mei W.
PY - 2013
Y1 - 2013
N2 - This paper proposes an OpenCL extension, clMPI, that allows a programmer to think as if GPUs communicate without any help of CPUs. The clMPI extension offers some OpenCL commands of inter-node data transfers that are executed in the same manner as the other OpenCL commands. Thus, clMPI naturally extends the conventional OpenCL programming model so as to improve the MPI interoperability. Unlike conventional joint programming of MPI and OpenCL, CPUs do not need to be blocked to serialize dependent operations of MPI and OpenCL. Hence, an application can easily use the opportunities to overlap parallel activities of CPUs and GPUs. In addition, the implementation details of data transfers are hidden behind the extension, and application programmers can use the optimized data transfers without any tricky programming techniques. As a result, the extension can improve not only the performance but also the performance portability across different system configurations. The evaluation results show that the clMPI extension can use the optimized data transfer implementation and thereby increase the sustained performance by about 14% for the Himeno benchmark if the communication time cannot be overlapped with the computation time.
AB - This paper proposes an OpenCL extension, clMPI, that allows a programmer to think as if GPUs communicate without any help of CPUs. The clMPI extension offers some OpenCL commands of inter-node data transfers that are executed in the same manner as the other OpenCL commands. Thus, clMPI naturally extends the conventional OpenCL programming model so as to improve the MPI interoperability. Unlike conventional joint programming of MPI and OpenCL, CPUs do not need to be blocked to serialize dependent operations of MPI and OpenCL. Hence, an application can easily use the opportunities to overlap parallel activities of CPUs and GPUs. In addition, the implementation details of data transfers are hidden behind the extension, and application programmers can use the optimized data transfers without any tricky programming techniques. As a result, the extension can improve not only the performance but also the performance portability across different system configurations. The evaluation results show that the clMPI extension can use the optimized data transfer implementation and thereby increase the sustained performance by about 14% for the Himeno benchmark if the communication time cannot be overlapped with the computation time.
KW - MPI interoperability
KW - OpenCL extension
KW - clMPI
UR - http://www.scopus.com/inward/record.url?scp=84899722879&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84899722879&partnerID=8YFLogxK
U2 - 10.1109/IPDPSW.2013.183
DO - 10.1109/IPDPSW.2013.183
M3 - Conference contribution
AN - SCOPUS:84899722879
SN - 9780769549798
T3 - Proceedings - IEEE 27th International Parallel and Distributed Processing Symposium Workshops and PhD Forum, IPDPSW 2013
SP - 1138
EP - 1148
BT - Proceedings - IEEE 27th International Parallel and Distributed Processing Symposium Workshops and PhD Forum, IPDPSW 2013
PB - IEEE Computer Society
Y2 - 22 July 2013 through 26 July 2013
ER -