Recent advancement of FPGAs allows high-performance and low-power computing by constructing deeply-pipelined custom hardware using floating-point DSP blocks. In this paper, we present a stream-computing architecture and design for FPGA-based high-performance N-body simulation, which is different from the parallel-computing-and-reduction approach of the GRAPE systems, which are predecessors of custom N-body machines. The proposed architecture is composed of a force-pipeline module (FPM) and an integral-pipeline module (IPM). FPM has a scalable structure based on n cascade-connected pairs of computing elements (CEs) and streamed register files (SRFs) so that we can scale the performance by increasing n. We also present the performance model. The measure performance of the system prototyped with a single Arria10 FPGA has good agreement with the model, and scales well with n at a higher efficiency when the problem size is large. We demonstrate that the system with n = 64 CEs operating at 180 MHz achieves 10944 MFCPS (million force calculation per second) for N = 262144 particles.