TY - GEN
T1 - Scaling performance for N-body stream computation with a ring of FPGAS
AU - Huthmann, Jens
AU - Shin, Abiko
AU - Podobas, Artur
AU - Sano, Kentaro
AU - Takizawa, Hiroyuki
N1 - Funding Information:
This research was partially supported by the Grant-in-Aid for Sci-entic Research (B) No.17H01706 from MEXT.
Publisher Copyright:
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2019/6/6
Y1 - 2019/6/6
N2 - Field-Programmable Gate Arrays (FPGAs) oer a fairly non-invasive method to specialize custom architectures towards a specic application domain. Recent studies has successfully demonstrated that single-node FPGAs can be a rival to both CPUs and GPUs in performance. Unfortunately, most existing studies limit themselves to using a single FPGA devices, and their scalability requires more investigation. In this work, we practically demonstrate how to scale the important n-body problem across a comparatively large FPGA cluster. Our design – composed of up to 256 processing elements – achieves near-linear strong scaling, with performance-levels comparable to that of custom Application-Specic Integrated Circuits (ASICs). We further develop an analytical performance model, which we use to predict the performance of our solution onto future upcoming Intel Agilex systems. Today, our system reaches up to 47 Giga-Pairs/second, and using our performance model we predict that we can reach up-to 0.142 Tera-Pairs/second peak performance with next-generation FPGAs.
AB - Field-Programmable Gate Arrays (FPGAs) oer a fairly non-invasive method to specialize custom architectures towards a specic application domain. Recent studies has successfully demonstrated that single-node FPGAs can be a rival to both CPUs and GPUs in performance. Unfortunately, most existing studies limit themselves to using a single FPGA devices, and their scalability requires more investigation. In this work, we practically demonstrate how to scale the important n-body problem across a comparatively large FPGA cluster. Our design – composed of up to 256 processing elements – achieves near-linear strong scaling, with performance-levels comparable to that of custom Application-Specic Integrated Circuits (ASICs). We further develop an analytical performance model, which we use to predict the performance of our solution onto future upcoming Intel Agilex systems. Today, our system reaches up to 47 Giga-Pairs/second, and using our performance model we predict that we can reach up-to 0.142 Tera-Pairs/second peak performance with next-generation FPGAs.
UR - http://www.scopus.com/inward/record.url?scp=85070551645&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85070551645&partnerID=8YFLogxK
U2 - 10.1145/3337801.3337813
DO - 10.1145/3337801.3337813
M3 - Conference contribution
AN - SCOPUS:85070551645
T3 - ACM International Conference Proceeding Series
BT - Proceedings of the 10th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies, HEART 2019
PB - Association for Computing Machinery
T2 - 10th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies, HEART 2019
Y2 - 6 June 2019 through 7 June 2019
ER -