Miquel Pericàs' Blog: July 2010

Tuesday, July 13, 2010

GRAPE-DR ranks 1st on Little Green500

It has been a while since I last heard from the GRAPE-DR project. This week it was finally announced that a recently installed GRAPE-DR system was holding the top spot on the Little Green 500 list, boosting a remarkable 815MFLOPS per Watt. GRAPE-DR is a generalization of the GRAPE systems professor Jun Makino had been working on for astrophysical simulations. A paper on GRAPE-DR had been presented at Supercomputing'07. I still remember reading it back then. The new system has been designed in collaboration with professor Kei Hiraki from 東大 (Tokyo U.) whom I had the pleasure to meet at ISCA'0'4 in Munich. Glad to see the system has been finally built. Although the current system is by no way the 2PFLOPS machine that the authors intended to build by 2008, 23 TFOPS for a 64-node machine is still a quite remarkable feat. Every node has one accelerator board that holds 4 GRAPE-DR chips, each doing 200GFLOPS double precision in just 50 Watt. More information can be found here.

Friday, July 2, 2010

GPU vs CPU: Intel debunks Nvidia?

It is quite uncommon to find an academic paper in a major computer architecture conference to attract so much attention. But it is apparent that Intel's paper comparing (or, "debunking") GPU performance against CPU (in terms of speed-ups) has caused quite some movement. Several forums including Linux Magazine, Beyond 3d and even nvidia's forums have reported on this paper.

While GPUs often report extraordinary peak performance based on FLOPS rate, many applications are actually mainly limited by memory bandwidth. If you compare the available memory bandwidhts of both CPUs and GPUs it makes sense that reported speed-ups are in the vecinity of 1-10x. After all, adding more compute units to increse theoretical peak FLOPS rate (as Moore's Law permits) is apparently simpler than increasing the memory bandwidth. There are reports already claiming 128GFLOPS for a Intel Sandy Bridge (SB) Processor with 4 cores. High throughput for such a processor can only be sustained if the application offers enough data reuse in the caches (same for GPU). No reuse at all would require 2TB/s (!) input bandwidth to keep all these SB units busy. Expect to see future system's performance to be increasingly dominated by external bandwidths.