Time to take some measurements to see what we can do with these various technologies. I've been waiting for AMD / ATI to get caught up to Nvidia so I could benchmark multiple vendors implementations of OpenCL. Unfortunately AMD / ATI will not have an OpenCL driver that will run on RedHat until Q1 of 2010 and I don't want to compare on disparate systems... so for now... AMD / ATI will not be evaluated.
The benchmark that I will be running uses the original binomial option model that Nvidia shipped back in 2007. While their newer version might be slightly faster it is significantly more complicated so I will be using the original. I ported it to OpenCL so that I can compare OpenCL's performance to CUDA and raw CPU performance. I will be timing how long it takes to run 1024 options through the different implementations of the model.
The GPU that I am using is an Nvidia C1060 Tesla card. The card is housed in a Dell 690 with a 3.2 GHz Intel Xeon processor, 2G of RAM, and the C1060 installed. The system is running RedHat 5.3 64 bit with Nvidia's 3.0 Beta SDK and appropriate beta driver installed.
As you can see from the graph above the fastest version of the model was the CUDA implementation. The CUDA version of the model was 2.2X faster than the OpenCL version in single precision and 25% faster than the OpenCL version in double precision. This is not particularly surprising given that CUDA has been around for a few years and Nvidia's OpenCL implementation has been around for less than a year (and is still Beta!).
What is amazing is that the single precision OpenCL implementation is 110X faster than the CPU implementation. This means that you would need 110 3.2GHz Intel Xeon processor cores to keep up with 1 Nvidia C1060 card. If we estimate the cost of a processing core at $1,000.00 then this $1,300.00 card is doing the work of $110,000.00 worth of CPUs!
While single precision might be OK for the Oil & Gas industry it's typically not OK for the Financial industry. The double precision OpenCL implementation is only 12.8X faster than the CPU implementation. So the C1060 is only doing $12,800.00 worth of CPU work in double precision. While this is significantly less cost effective then the single precision implementation it's still an order of magnitude less expensive than performing the calculations on a CPU.
Why such a big difference between single and double precision? While the C1060 can support double precision it has fewer double precision ALUs. Which isn't surprising when you consider that it is a Graphics Processing Unit afterall. Why would you want the added hardware cost of double precision math capabilities for a device that drives a monitor? Does the accuracy really matter that much to the human eye?
So what conclusions can we draw from this...? Right now if you want the fastest models then CUDA is the way to go. If you want the flexibility to move to AMD / ATI, AMD Fusion, or Intel Larrabee then OpenCL is the way to go. Also, is the 25% penality that you are going to pay for the OpenCL over CUDA really that big of a deal? You're still achieving a order of magnitude speed up over the CPU implementation.