Tuesday, July 13, 2010

GRAPE-DR ranks 1st on Little Green500

It has been a while since I last heard from the GRAPE-DR project. This week it was finally announced that a recently installed GRAPE-DR system was holding the top spot on the Little Green 500 list, boosting a remarkable 815MFLOPS per Watt. GRAPE-DR is a generalization of the GRAPE systems professor Jun Makino had been working on for astrophysical simulations. A paper on GRAPE-DR had been presented at Supercomputing'07. I still remember reading it back then. The new system has been designed in collaboration with professor Kei Hiraki from 東大 (Tokyo U.) whom I had the pleasure to meet at ISCA'0'4 in Munich. Glad to see the system has been finally built. Although the current system is by no way the 2PFLOPS machine that the authors intended to build by 2008, 23 TFOPS for a 64-node machine is still a quite remarkable feat. Every node has one accelerator board that holds 4 GRAPE-DR chips, each doing 200GFLOPS double precision in just 50 Watt. More information can be found here.

Friday, July 2, 2010

GPU vs CPU: Intel debunks Nvidia?

It is quite uncommon to find an academic paper in a major computer architecture conference to attract so much attention. But it is apparent that Intel's paper comparing (or, "debunking") GPU performance against CPU (in terms of speed-ups) has caused quite some movement. Several forums including Linux Magazine, Beyond 3d and even nvidia's forums have reported on this paper.

While GPUs often report extraordinary peak performance based on FLOPS rate, many applications are actually mainly limited by memory bandwidth. If you compare the available memory bandwidhts of both CPUs and GPUs it makes sense that reported speed-ups are in the vecinity of 1-10x. After all, adding more compute units to increse theoretical peak FLOPS rate (as Moore's Law permits) is apparently simpler than increasing the memory bandwidth. There are reports already claiming 128GFLOPS for a Intel Sandy Bridge (SB) Processor with 4 cores. High throughput for such a processor can only be sustained if the application offers enough data reuse in the caches (same for GPU). No reuse at all would require 2TB/s (!) input bandwidth to keep all these SB units busy. Expect to see future system's performance to be increasingly dominated by external bandwidths.

Friday, June 25, 2010

Xilinx announces Virtex-7, Artix-7 and Kintex-7

These new 28nm devices will be available in 2011 Q1. Check the press release and this web for more details. The largest Virtex-7 device will feature up to 2 million logic blocks, about 2.5 times the largest Virtex-6 (LX760), and also 2x de memory bandwidth. Improvements seem mostly in line with the new process technology.

For the new 7 series devices, focus has been placed on IP portability among the different devices of the series. This seems an important step. Now let's hope that other standards arise also at the platform and high-level synthesis levels.

First post

Ok, I have never been a good blogger (usually my blog dies just after just some weeks of usage) but I will try again. I guess this will be basically some announcements of ongoing events or news that I consider to be important :-)

I create this blog attached to my new google site's web as a place for announcing and discussing events

Hope this one doesn't die too soon :-)