This paper proposes an OpenCL extension, clMPI, that allows a programmer to think as if GPUs communicate without any help of CPUs. The clMPI extension offers some OpenCL commands of inter-node data transfers that are executed in the same manner as the other OpenCL commands. Thus, clMPI naturally extends the conventional OpenCL programming model so as to improve the MPI interoperability. Unlike conventional joint programming of MPI and OpenCL, CPUs do not need to be blocked to serialize dependent operations of MPI and OpenCL. Hence, an application can easily use the opportunities to overlap parallel activities of CPUs and GPUs. In addition, the implementation details of data transfers are hidden behind the extension, and application programmers can use the optimized data transfers without any tricky programming techniques. As a result, the extension can improve not only the performance but also the performance portability across different system configurations. The evaluation results show that the clMPI extension can use the optimized data transfer implementation and thereby increase the sustained performance by about 14% for the Himeno benchmark if the communication time cannot be overlapped with the computation time.