Problem in choosing best available GPU using openCL programmatically - c++11

I'm using the advice given here for choosing an optimal GPU for my algorithm.
https://stackoverflow.com/a/33488953/5371117
I query the devices on my MacBook Pro using boost::compute::system::devices(); which returns me following list of devices.
Intel(R) Core(TM) i7-8850H CPU # 2.60GHz
Intel(R) UHD Graphics 630
AMD Radeon Pro 560X Compute Engine
I want to use AMD Radeon Pro 560X Compute Engine for my purpose but when I iterate to find the device with maximum rating = CL_DEVICE_MAX_CLOCK_FREQUENCY * CL_DEVICE_MAX_COMPUTE_UNITS. I get the following results:
Intel(R) Core(TM) i7-8850H CPU # 2.60GHz,
freq: 2600, compute units: 12, rating:31200
Intel(R) UHD Graphics 630,
freq: 1150, units: 24, rating:27600
AMD Radeon Pro 560X Compute Engine,
freq: 300, units: 16, rating:4800
AMD GPU has the lowest rating. Also I looked into the specs and it seems to me that CL_DEVICE_MAX_CLOCK_FREQUENCY isn't returning correct value.
According to AMD Chip specs https://www.amd.com/en/products/graphics/radeon-rx-560x, my AMD GPU has base frequency of 1175 MHz, not 300MHz.
According to Intel Chip specs https://en.wikichip.org/wiki/intel/uhd_graphics/630, my Intel GPU has base frequency of 300 MHz, not 1150MHz, but it does have a boost frequency of 1150MHz
std::vector<boost::compute::device> devices = boost::compute::system::devices();
std::pair<boost::compute::device, ai::int64> suitableDevice{};
for(auto& device: devices)
{
auto rating = device.clock_frequency() * device.compute_units();
std::cout << device.name() << ", freq: " << device.clock_frequency() << ", units: " << device.compute_units() << ", rating:" << rating << std::endl;
if(suitableDevice.second < benchmark)
{
suitableDevice.first = device;
suitableDevice.second = benchmark;
}
}
Am I doing anything wrong?

Those properties are unfortunately only really directly comparable within an implementation (same HW manufacturer, same OS).
My recommendation would be to:
First filter out anything with a device type other than CL_DEVICE_TYPE_GPU (unless there aren't any GPUs available, in which case you may want to fall back to CPU).
Check any other important device properties. For example, if you need support for a particular OpenCL version or extension, or if you need especially large work groups or local memory, check all remaining devices and filter out any that can't run your code.
Test whether any of the remaining devices return true for the CL_DEVICE_HOST_UNIFIED_MEMORY property. These will be integrated GPUs, and these are usually slower than discrete ones, unless you are bound by data transfer speeds, in which case they might be faster. So you'll want to prefer one type over the other.
If you're still left with more than one device after that, you can apply your existing heuristic.

This code will return the device with the most floating-point performance
select_device_with_most_flops(find_devices());
and this the device with the most memory
select_device_with_most_memory(find_devices());
At first, find_devices() returns a vector of all OpenCL devices in the system. select_device_with_most_memory() is straightforward and uses getInfo<CL_DEVICE_GLOBAL_MEM_SIZE>().
Floating-point performance is given by this equation:
FLOPs/s = cores/CU * CUs * IPC * clock frequency
select_device_with_most_flops() is more difficult, because OpenCL does only provide the number of compute units (CUs) getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>(), which for a CPU is the number of threads and for a GPU has to be multiplied by the number of stream processors / cuda cores per CU, which is different for Nvidia, AMD and Intel as well as their different microarchitectures and is usually between 4 and 128. Luckily, the vendor is included in getInfo<CL_DEVICE_VENDOR>(). So based on the vendor and the amount of CUs one can figure out the number of cores per CU.
The next part is the FP32 IPC or instructions per clock. For most GPUs, this is 2, while for recent CPUs this is 32, see https://en.wikipedia.org/wiki/FLOPS?oldformat=true#FLOPs_per_cycle_for_various_processors
There is no way to figure out the IPC in OpenCL directly, so the 32 for CPUs is just a guess. One could use the device name and a lookup table to be more accurate. getInfo<CL_DEVICE_TYPE>()==CL_DEVICE_TYPE_GPU will result in true if the device is a GPU.
The final part is the clock frequency. OpenCL provides the base clock frequency in MHz by getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>(). The device can boost higher frequencies, so this again is an approximation.
All of it together gives an estimation for the floating-point performance. The full code is shown below:
typedef unsigned int uint;
string trim(const string s) { // removes whitespace characters from beginnig and end of string s
const int l = (int)s.length();
int a=0, b=l-1;
char c;
while(a<l && ((c=s.at(a))==' '||c=='\t'||c=='\n'||c=='\v'||c=='\f'||c=='\r'||c=='\0')) a++;
while(b>a && ((c=s.at(b))==' '||c=='\t'||c=='\n'||c=='\v'||c=='\f'||c=='\r'||c=='\0')) b--;
return s.substr(a, 1+b-a);
}
bool contains(const string s, const string match) {
return s.find(match)!=string::npos;
}
vector<Device> find_devices() {
vector<Platform> platforms; // get all platforms (drivers)
vector<Device> devices_available;
vector<Device> devices; // get all devices of all platforms
Platform::get(&platforms);
if(platforms.size()==0) print_error("There are no OpenCL devices available. Make sure that the OpenCL 1.2 Runtime for your device is installed. For GPUs it comes by default with the graphics driver, for CPUs it has to be installed separately.");
for(uint i=0; i<(uint)platforms.size(); i++) {
devices_available.clear();
platforms[i].getDevices(CL_DEVICE_TYPE_ALL, &devices_available); // CL_DEVICE_TYPE_CPU, CL_DEVICE_TYPE_GPU
if(devices_available.size()==0) continue; // no device of type device_type found in plattform i
for(uint j=0; j<(uint)devices_available.size(); j++) devices.push_back(devices_available[j]);
}
print_device_list(devices);
return devices;
}
Device select_device_with_most_flops(const vector<Device> devices) { // return device with best floating-point performance
float best_value = 0.0f;
uint best_i = 0; // index of fastest device
for(uint i=0; i<(uint)devices.size(); i++) { // find device with highest (estimated) floating point performance
const Device d = devices[i];
//const string device_name = trim(d.getInfo<CL_DEVICE_NAME>());
const string device_vendor = trim(d.getInfo<CL_DEVICE_VENDOR>()); // is either Nvidia, AMD or Intel
const uint device_compute_units = (uint)d.getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>(); // compute units (CUs) can contain multiple cores depending on the microarchitecture
const bool device_is_gpu = d.getInfo<CL_DEVICE_TYPE>()==CL_DEVICE_TYPE_GPU;
const uint device_ipc = device_is_gpu?2u:32u; // IPC (instructions per cycle) is 2 for GPUs and 32 for most modern CPUs
const uint nvidia = (uint)(contains(device_vendor, "NVIDIA")||contains(device_vendor, "vidia"))*(device_compute_units<=30u?128u:64u); // Nvidia GPUs usually have 128 cores/CU, except Volta/Turing (>30 CUs) which have 64 cores/CU
const uint amd = (uint)(contains(device_vendor, "AMD")||contains(device_vendor, "ADVANCED")||contains(device_vendor, "dvanced"))*(device_is_gpu?64u:1u); // AMD GCN GPUs usually have 64 cores/CU, AMD CPUs have 1 core/CU
const uint intel = (uint)(contains(device_vendor, "INTEL")||contains(device_vendor, "ntel"))*(device_is_gpu?8u:1u); // Intel integrated GPUs usually have 8 cores/CU, Intel CPUs have 1 core/CU
const uint device_cores = device_compute_units*(nvidia+amd+intel);
const uint device_clock_frequency = (uint)d.getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>(); // in MHz
const float device_tflops = 1E-6f*(float)device_cores*(float)device_ipc*(float)device_clock_frequency; // estimated device floating point performance in TeraFLOPs/s
if(device_tflops>best_value) { // device_memory>best_value
best_value = device_tflops; // best_value = device_memory;
best_i = i; // find index of fastest device
}
}
return devices[best_i];
}
Device select_device_with_most_memory(const vector<Device> devices) { // return device with largest memory capacity
float best_value = 0.0f;
uint best_i = 0; // index of fastest device
for(uint i=0; i<(uint)devices.size(); i++) { // find device with highest (estimated) floating point performance
const Device d = devices[i];
const float device_memory = 1E-3f*(float)(d.getInfo<CL_DEVICE_GLOBAL_MEM_SIZE>()/1048576ull); // in GB
if(device_memory>best_value) {
best_value = device_memory;
best_i = i; // find index of fastest device
}
}
return devices[best_i];
}
Device select_device_with_id(const vector<Device> devices, const int id) { // return device
if(id>=0&&id<(int)devices.size()) {
return devices[id];
} else {
print("Your selected device ID ("+to_string(id)+") is wrong.");
return devices[0]; // is never executed, just to avoid compiler warnings
}
}
UPDATE: I have now included an improved version of this in a lightweight OpenCL-Wrapper. This correctly calculates the FLOPs for all CPUs and GPUs from the last decade or so: https://github.com/ProjectPhysX/OpenCL-Wrapper

Related

How to leverage multichannel RAM?

My objective
I want to write a RAM bandwidth benchmark
My solution
U64 benchRead(U64 &io_minLoopTicks) //io_minLoopTicks initial value is U64_MAX
{
const U32 elementsCount = 100'000'000;
U64 accumulator = 0;
std::vector<U64> tab;
tab.resize(elementsCount); //800MB
for(U8 i = 0; i < 10; ++i) // Loops to get a reliable value.
{
const U64 startTimestamp = profiler.start(); // profiler uses under the hood QueryPerformanceCounter
for(const U64 j : tab)
accumulator += j; // Do something with the RAM to not be optimized away.
const U64 loopTicks = profiler.end() - startTimestamp;
if(loopTicks < io_minLoopTicks)
io_minLoopTicks = loopTicks;
}
return accumulator;
}
My problem
The theory would be 57.6 GB/s (8 bytes (bus width) * 2 (channels) * 3.6 GT).
In practise I have 27.7 GB/s (9700K #4.8GHz, 4 single rank 3600MT DDR4 DIMMs).
So my number seems to reflect only one channel performance.
My guess is that the way I coded, the array is in an area of RAM that can only be accessed by one channel. I am not familiar at all with multi-channel so I don't really understand the limits of that technology.
For sure my RAM is dual channel (CPU-Z confirmed it)
Performance analysis
I checked the ouput of Clang14 and it uses SSE (O3 with no -march=native). I could be wrong (not an ASM guy) but it seems it unrolled the loop and does 128 bytes per iteration. A loop is 4.5 cycles on Coffee Lake according to uiCA leading to 139.38 GB/s, way above the RAM theoretical speed. So we should be RAM limited.
My questions
Is my array stored in one bank hence the fact it can only be accessed by one channel ?
How do we write code to leverage the benefits of multichannel RAM (here dual channel) ?
References
Godbolt

Perf impact of sampling rate on performance - higher sample rates cost *less* overhead on NXP S32?

I am using perf in sampling mode to capture performance statistics of programs running on multi-core platform from NXP S32 platform running Linux 4.19.
E.g configuration
Core 0 - App0 , Core 1 - App1, Core 2 - App2
Without sampling i.e. at program level, App0 takes 6.9 seconds.
On sampling at 1Million cycles,App0 takes 6.3 sec
On sampling at 2Million cycles, App0 takes 6.4 sec
On sampling at 5Million cycles, App0 takes 6.5 sec
On sampling at 100Million cycles, App0 takes 6.8 sec.
As you can see with higher sampling period (100Million) App0 takes higher time to finish execution.
Actually I would have expected the opposite, i.e. sampling at 1Million cycles should have resulted in the program taking more time to execute due higher number of samples generated (perf overhead) as compared to 100 Million cycles?
I am unable to explain this behavior what do you think is causing this?
Any leads would be helpful.
P.S - On the Pi3B the behavior is as expected i.e sampling at 1million cycles results in longer execution time compared to 100 Million cycles.
UPDATE: I do not use perf from command line, instead make a perf system call using the perf event with the following flags in the struct perf_event_attr.
struct perf_event_attr hw_event;
pid_t pid = proccess_id; // measure the current process/thread
int cpu = -1; // measure on any cpu
unsigned long flags = 0;
int fd_current;
memset(&hw_event, 0, sizeof(struct perf_event_attr));
hw_event.type = event_type;
hw_event.size = sizeof(struct perf_event_attr);
hw_event.config = event;
if(group_fd == -1)
{
hw_event.sample_period = 2000000;
hw_event.sample_type = PERF_SAMPLE_READ;
hw_event.precise_ip = 1;
}
hw_event.disabled = 1; // off by default. specifies whether the counter starts out disabled or enabled.
hw_event.exclude_kernel = 0; // excluding events that happen in the kernel-space
hw_event.exclude_hv = 1; // excluding events that happen in the hypervisor
hw_event.pinned = pinned; // specifies the counter to be on the CPU if at all possible. applies only to hardware counters and only to group leaders.
hw_event.exclude_user = 0; // excludes events that happen in user space
hw_event.exclude_callchain_kernel = 0; // Do not include kernel callchains.
hw_event.exclude_callchain_user = 0; // Do not include user callchains.
hw_event.read_format = PERF_FORMAT_GROUP; // Allows all counter values in an event group to be read with one read
fd_current = syscall(__NR_perf_event_open, &hw_event, pid, cpu, group_fd, flags);
if (fd_current == -1) {
printf("Error opening leader %llx\n", hw_event.config);
exit(EXIT_FAILURE);
}
return fd_current;

Allocate maximum memory available

I currently try to write a programm which is supposed to allocate the maximum memory available. I came to a solution which limits the area of potential available memory until both borders are equal (see listing)
void enforceMemoryLeakage(void** arrayOfAllocMemory)
{
unsigned int maxMemory = 0x80000000;
unsigned int minMemory = 0x50000000;
unsigned int attempAllocatedMemory = minMemory + (maxMemory - minMemory) / 2;
void* pAllocMemory;
while((maxMemory - minMemory) > 1)
{
pAllocMemory = malloc(attempAllocatedMemory);
if (pAllocMemory != NULL)
{
minMemory = attempAllocatedMemory;
attempAllocatedMemory += (maxMemory - minMemory) / 2;
free(pAllocMemory);
}
else
{
maxMemory = attempAllocatedMemory;
attempAllocatedMemory = minMemory + (maxMemory - minMemory) / 2;
}
}
arrayOfAllocMemory[0] = malloc(maxMemory);
void* pAllocAdditionalMemory = malloc(100);
if (pAllocAdditionalMemory == NULL)
std::cout << "Maximum memory: " << minMemory << "\n";
}
The above displayed code works fine. However if the command is executed
void* pAllocAdditionalMemory = malloc(100);
if (pAllocAdditionalMemory == NULL)
std::cout << "Maximum memory: " << minMemory << "\n";
I would have expected that there is no further memory available. However it does not work which brings me to my actual question why the above shown approach does not work.
Best regards
Ratbald
You did not specify OS nor platform so I am completely guessing here so read with extreme prejudice...
Assuming you do not have a bug in your binary search code... My bet is you face memory fragmentation problems as you are successfully allocate/free memory during execution other processes can do the same so you might fragment your memory. Example:
OS has 2 MByte chunk of continuous free memory
you allocate 1.5 MByte of mem (0.5 MByte free)
some other process allocates 1 KByte (0.499 MByte free)
you free the 1 MByte (1.0 + 0.499 MByte free two fragments)
and attempt to allocate 1.25 MByte
but OS does not have 1.25 memory in single continuous chunk so it failed hence you allocate the 1MByte again (0.499 MByte free)
you succesffuly allocate 1 KByte (0.498 MByte free)
Depending on the OS memory management strategies you could sometimes even not need another process interfering to fragment memory ...
There are however another possibilities not related to Fragmentation. In case of emulations or WOW64 the OS will not allocate whole available RAM, also there are limits on single continuous chunk size. For example Win32 will not allow more than ~1.25 GByte but that does not mean there is only 1.25 GByte of free RAM ...

Why is `_mm_stream_si128` much slower than `_mm_storeu_si128` on Skylake-Xeon when writing parts of 2 cache lines? But less effect on Haswell

I have code that looks like this (simple load, modify, store) (I've simplified it to make it more readable):
__asm__ __volatile__ ( "vzeroupper" : : : );
while(...) {
__m128i in = _mm_loadu_si128(inptr);
__m128i out = in; // real code does more than this, but I've simplified it
_mm_stream_si12(outptr,out);
inptr += 12;
outptr += 16;
}
This code runs about 5 times faster on our older Sandy Bridge Haswell hardware compared to our newer Skylake machines. For example, if the while loop runs about 16e9 iterations, it takes 14 seconds on Sandy Bridge Haswell and 70 seconds on Skylake.
We upgraded to the lasted microcode on the Skylake,
and also stuck in vzeroupper commands to avoid any AVX issues. Both fixes had no effect.
outptr is aligned to 16 bytes, so the stream command should be writing to aligned addresses. (I put in checks to verify this statement). inptr is not aligned by design. Commenting out the loads doesn't make any effect, the limiting commands are the stores. outptr and inptr are pointing to different memory regions, there is no overlap.
If I replace the _mm_stream_si128 with _mm_storeu_si128, the code runs way faster on both machines, about 2.9 seconds.
So the two questions are
1) why is there such a big difference between Sandy Bridge Haswell and Skylake when writing using the _mm_stream_si128 intrinsic?
2) why does the _mm_storeu_si128 run 5x faster than the streaming equivalent?
I'm a newbie when it comes to intrinsics.
Addendum - test case
Here is the entire test case: https://godbolt.org/z/toM2lB
Here is a summary of the benchmarks I took on two difference processors, E5-2680 v3 (Haswell) and 8180 (Skylake).
// icpc -std=c++14 -msse4.2 -O3 -DNDEBUG ../mre.cpp -o mre
// The following benchmark times were observed on a Intel(R) Xeon(R) Platinum 8180 CPU # 2.50GHz
// and Intel(R) Xeon(R) CPU E5-2680 v3 # 2.50GHz.
// The command line was
// perf stat ./mre 100000
//
// STORER time (seconds)
// E5-2680 8180
// ---------------------------------------------------
// _mm_stream_si128 1.65 7.29
// _mm_storeu_si128 0.41 0.40
The ratio of stream to store is 4x or 18x, respectively.
I'm relying on the default new allocator to align my data to 16 bytes. I'm getting luck here that it is aligned. I have tested that this is true, and in my production application, I use an aligned allocator to make absolutely sure it is, as well as checks on the address, but I left that off of the example because I don't think it matters.
Second edit - 64B aligned output
The comment from #Mystical made me check that the outputs were all cache aligned. The writes to the Tile structures are done in 64-B chunks, but the Tiles themselves were not 64-B aligned (only 16-B aligned).
So changed my test code like this:
#if 0
std::vector<Tile> tiles(outputPixels/32);
#else
std::vector<Tile, boost::alignment::aligned_allocator<Tile,64>> tiles(outputPixels/32);
#endif
and now the numbers are quite different:
// STORER time (seconds)
// E5-2680 8180
// ---------------------------------------------------
// _mm_stream_si128 0.19 0.48
// _mm_storeu_si128 0.25 0.52
So everything is much faster. But the Skylake is still slower than Haswell by a factor of 2.
Third Edit. Purposely misalignment
I tried the test suggested by #HaidBrais. I purposely allocated my vector class aligned to 64 bytes, then added 16 bytes or 32 bytes inside the allocator such that the allocation was either 16 Byte or 32 Byte aligned, but NOT 64 byte aligned. I also increased the number of loops to 1,000,000, and ran the test 3 times and picked the smallest time.
perf stat ./mre1 1000000
To reiterate, an alignment of 2^N means it is NOT aligned to 2^(N+1) or 2^(N+2).
// STORER alignment time (seconds)
// byte E5-2680 8180
// ---------------------------------------------------
// _mm_storeu_si128 16 3.15 2.69
// _mm_storeu_si128 32 3.16 2.60
// _mm_storeu_si128 64 1.72 1.71
// _mm_stream_si128 16 14.31 72.14
// _mm_stream_si128 32 14.44 72.09
// _mm_stream_si128 64 1.43 3.38
So it is clear that cache alignment gives the best results, but _mm_stream_si128 is better only on the 2680 processor and suffers some sort of penalty on the 8180 that I can't explain.
For furture use, here is the misaligned allocator I used (I did not templatize the misalignment, you'll have to edit the 32 and change to 0 or 16 as needed):
template <class T >
struct Mallocator {
typedef T value_type;
Mallocator() = default;
template <class U> constexpr Mallocator(const Mallocator<U>&) noexcept
{}
T* allocate(std::size_t n) {
if(n > std::size_t(-1) / sizeof(T)) throw std::bad_alloc();
uint8_t* p1 = static_cast<uint8_t*>(aligned_alloc(64, (n+1)*sizeof(T)));
if(! p1) throw std::bad_alloc();
p1 += 32; // misalign on purpose
return reinterpret_cast<T*>(p1);
}
void deallocate(T* p, std::size_t) noexcept {
uint8_t* p1 = reinterpret_cast<uint8_t*>(p);
p1 -= 32;
std::free(p1); }
};
template <class T, class U>
bool operator==(const Mallocator<T>&, const Mallocator<U>&) { return true; }
template <class T, class U>
bool operator!=(const Mallocator<T>&, const Mallocator<U>&) { return false; }
...
std::vector<Tile, Mallocator<Tile>> tiles(outputPixels/32);
The simplified code doesn't really show the actual structure of your benchmark. I don't think the simplified code will exhibit the slowness you've mentioned.
The actual loop from your godbolt code is:
while (count > 0)
{
// std::cout << std::hex << (void*) ptr << " " << (void*) tile <<std::endl;
__m128i value0 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 0 * diffBytes));
__m128i value1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 1 * diffBytes));
__m128i value2 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 2 * diffBytes));
__m128i value3 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 3 * diffBytes));
__m128i tileVal0 = value0;
__m128i tileVal1 = value1;
__m128i tileVal2 = value2;
__m128i tileVal3 = value3;
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 0), tileVal0);
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 1), tileVal1);
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 2), tileVal2);
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 3), tileVal3);
ptr += diffBytes * 4;
count -= diffBytes * 4;
tile += diffPixels * 4;
ipixel += diffPixels * 4;
if (ipixel == 32)
{
// go to next tile
ipixel = 0;
tileIter++;
tile = reinterpret_cast<uint16_t*>(tileIter->pixels);
}
}
Note the if (ipixel == 32) part. This jumps to a different tile every time ipixel reaches 32. Since diffPixels is 8, this happens every iteration. Hence you are only making 4 streaming stores (64 bytes) per tile. Unless each tile happens to be 64-byte aligned, which is unlikely to happen by chance and cannot be relied on, this means that every write writes to only part of two different cache lines. That's a known anti-pattern for streaming stores: for effective use of streaming stores you need to write out the full line.
On to the performance differences: streaming stores have widely varying performance on different hardware. These stores always occupy a line fill buffer for some time, but how long varies: on lots of client chips it seems to only occupy a buffer for about the L3 latency. I.e., once the streaming store reaches the L3 it can be handed off (the L3 will track the rest of the work) and the LFB can be freed on the core. Server chips often have much longer latency. Especially multi-socket hosts.
Evidently, the performance of NT stores is worse on the SKX box, and much worse for partial line writes. The overall worse performance is probably related to the redesign of the L3 cache.

OpenCL very low GFLOPS, no data transfer bottleneck

I am trying to optimize an algorithm I am running on my GPU (AMD HD6850). I counted the number of floating point operations inside my kernel and measured its execution time. I found it to achieve ~20 SP GFLOPS, however according to the GPUs specs I should achieve ~1500 GFLOPS.
To find the bottleneck I created a very simple kernel:
kernel void test_gflops(const float d, global float* result)
{
int gid = get_global_id(0);
float cd;
for (int i=0; i<100000; i++)
{
cd = d*i;
}
if (cd == -1.0f)
{
result[gid] = cd;
}
}
Running this kernel I get ~5*10^5 work_items/sec. I count one floating point operation (not sure if that's right, what about incrementing i and comparing it to 100000?) per iteration of the loop.
==> 5*10^5 work_items/sec * 10^5 FLOPS = 50 GFLOPS.
Even if there are 3 or 4 operations going on in the loop, it's much slower than the what the card should be able to do. What am I doing wrong?
The global work size is big enough (no speed change for 10k vs 100k work items).
Here are a couple of tricks:
GPU doesn't like cycles at all. Use #pragma unroll to unwind them.
Your GPU is good at vector operations. Stick to it, that will allow you to process multiple operands at once.
Use vector load/store whether it's possible.
Measure the memory bandwidth - I'm almost sure that you are bandwidth-limited because of poor access pattern.
In my opinion, kernel should look like this:
typedef union floats{
float16 vector;
float array[16];
} floats;
kernel void test_gflops(const float d, global float* result)
{
int gid = get_global_id(0);
floats cd;
cd.vector = vload16(gid, result);
cd.vector *= d;
#pragma unroll
for (int i=0; i<16; i++)
{
if(cd.array[i] == -1.0f){
result[gid] = cd;
}
}
Make your NDRange bigger to compensate difference between 16 & 1000 in loop condition.

Resources