For those who haven’t heard, AMD recently released a beta OpenCL driver for x86 CPUs. After seeing AMD’s impressive OpenCL demo that scales on and on as you add CPU cores, I was dying (!) to give their OpenCL beta driver (which currently only supports CPUs) a test drive – especially on my Intel system.
I used the SDK’s SimpleConvolution sample as a testbench to get some performance measurements compared to the very straight forward reference CPU implementation also included the SDK (compiled w/ just the -O3 gcc flag in release mode). The regular CPU implementation is also used to validate the OpenCL results.
I did make a couple of modifications to the sample:
- Increased the mask size to 5×5 (a bit more realistic if doing something like a gaussian blur with a sigma of 1.5)
- Made the input and output data floats rather than uints (who the f— in scientific computing cares about uints?)
- Modified the comparison against reference CPU implementation to account for an epsilon that results from FP op precision differences.
The testbench ran on my Intel T7600 Core Duo Laptop (2.33Ghz, 3.2GB memory), Ubuntu 8.04 32-bit Desktop version. I ran a number of iterations with varying local workgroup sizes (ie number of work-items per workgroup, or threadsPerBlock in CUDA-talk) to see how performance scales w/ and compares against the CPU reference implementation for an input matrix of 4096×4096.
Here’s a breakdown of results (the x-axis is workgroup size, y-axis is seconds):
I was, naively, expecting that the number work-items would map to the number of persistently active threads spawned by CL hence reaching peak performance at 2 work-items. But it doesn’t seem like CL ever spawns more than 4 threads. It seems more likely that a thread is spawned, made responsible for processing a whole workgroup, then killed/joined after it finishes its work.
I also ran some “Null Kernel” cases where I commented out all of the SimpleConvolution CL Kernel. The “Null Kernel” case does seem to suggest that AMD’s OpenCL either (or both) creates, executes, then joins a thread per workgroup -or- it spends a lot of cycles for workgroup management.
This may be the reason why you actually continue to see perf increases as you increase the workgroup size – until it failed to execute at a size of 2048. It seems like at sufficiently large workgroup sizes >= 256, the workgroup management time does get amortized out over the number of work-items processed.
Also interesting is the actual performance numbers compared to the CPU reference. OpenCL may scale with the number of cores but it’s raw performance is good but not mind shattering. Theoretically, with good use of SSE and being mindful of cache locality, you can improve single-threaded performance by 3-4x putting a single threaded CPU implementation neck to neck with the 256-512 CL results. Spawn two threads to work on separate parts of the matrix and you get another 1.5x in perf increase pushing it a magnitude or two faster than the CL implementation.
I’ll take AMD’s word for it that it’s CL driver will scale with cores. But maybe it would be interesting if someone with > 2 core Intel system can also see near linear scaling. Or maybe I can be motivated enough to run my experiment by disabling one of my cores on my laptop.
So I conclude with the following:
- My timing mechanics could be entirely wrong and everything here can be thrown in the trash… then again, this is highly unlikely since I’m using exactly what AMD does to measure event times for CL kernel execution and CPU perf times.
- Each workgroup spawns and the joins a new thread and/or incurrs a workgroup management cost. So despite the AMD SDK using workgroup size of 1 for most of its CL samples, you will get more optimal performance with larger workgroup sizes.
- The AMD CL stack still has room for performance improvements since CPU reference seems to indicate so.
… or maybe AMD’s OpenCL driver is just sophisticatedly inefficient for Intel based systems.