TY - GEN
T1 - An Efficient Skinny Matrix-Matrix Multiplication Method by Folding Input Matrices into Tensor Core Operations
AU - Tang, Hao
AU - Komatsu, Kazuhiko
AU - Sato, Masayuki
AU - Kobayashi, Hiroaki
N1 - Publisher Copyright:
© 2020 IEEE.
Copyright:
Copyright 2021 Elsevier B.V., All rights reserved.
PY - 2020/11
Y1 - 2020/11
N2 - A specialized unit in NVIDIA's GPUs, called Tensor Core, keeps attracting attention in the last couple of years due to its high computing capability for general matrix-matrix multiplications (GEMMs). A Tensor Core unit is capable of calculating a matrix multiply-accumulate (MMA) operation of a specific size. However, if the size of input matrices is skinner than that of a Tensor Core operation, some computations of a Tensor Core operation become wasted. Thus, this paper presents a method to optimize the calculation of skinny matrix-matrix multiplication that exploits the potential of the Tensor core units. The proposed method feeds multiple segments of an input matrix into a Tensor Core operation to utilize more computations. The experimental results show that the proposed method achieves up to a 2.7× speedup compared with the cuBLAS 11.0 library.
AB - A specialized unit in NVIDIA's GPUs, called Tensor Core, keeps attracting attention in the last couple of years due to its high computing capability for general matrix-matrix multiplications (GEMMs). A Tensor Core unit is capable of calculating a matrix multiply-accumulate (MMA) operation of a specific size. However, if the size of input matrices is skinner than that of a Tensor Core operation, some computations of a Tensor Core operation become wasted. Thus, this paper presents a method to optimize the calculation of skinny matrix-matrix multiplication that exploits the potential of the Tensor core units. The proposed method feeds multiple segments of an input matrix into a Tensor Core operation to utilize more computations. The experimental results show that the proposed method achieves up to a 2.7× speedup compared with the cuBLAS 11.0 library.
KW - GEMM
KW - GPU
KW - optimization
KW - tall-and-skinny
KW - Tensor Core
UR - http://www.scopus.com/inward/record.url?scp=85102207989&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85102207989&partnerID=8YFLogxK
U2 - 10.1109/CANDARW51189.2020.00041
DO - 10.1109/CANDARW51189.2020.00041
M3 - Conference contribution
AN - SCOPUS:85102207989
T3 - Proceedings - 2020 8th International Symposium on Computing and Networking Workshops, CANDARW 2020
SP - 164
EP - 167
BT - Proceedings - 2020 8th International Symposium on Computing and Networking Workshops, CANDARW 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 8th International Symposium on Computing and Networking Workshops, CANDARW 2020
Y2 - 24 November 2020 through 27 November 2020
ER -