Portability and Optimization of OpenCL between Radeon Graphic Cards - parallel-processing

I'm planning on diving into OpenCL and have been reading (only surface knowledge) on what OpenCL can do, but have a few questions.
Let's say I have an AMD Radeon 7750 and I have another computer that has an AMD Radeon 5870 and no plans on using a computer with an Nvidia card. I heard that optimizing the code for a particular device brings performance benefits. What exactly does optimizing mean? From what I've read and a little bit of guessing, it sounds like it means writing the code in a way that a GPU likes (in general without concern that it's an AMD or Nvidia card) as well as in a way that matches how the graphics card handles memory (I'm guessing this is compute device specific? Or is this only brand specific?).
So if I write code and optimized it for the Radeon 7750, would I be able to bring that code to the other computer with the Radeon 5870 and, without changing any part of the code, still retain a reasonable amount of performance benefits from the optimization? In the event that the code doesn't work, would changing parts of the code be a minor issue or would it involve rewriting enough code that it would have been a better idea to have written an optimized code for the Radeon 5870 in the first place.

Without more information about the algorithms and applications you intend to write, the question is a little vague. But I think I can give you some high-level strategies to keep in mind as you develop your code for these two different platforms.
The Radeon 7750's design is of the new Graphics Core Next architecture, while your HD5780 is based on the older VLIW5 (RV770) Architecture.
For your code to perform well on the HD5780 hardware you must make as heavy use of the packed primitive datatypes as possible, especially the int4, float4 types. This is because the OpenCL compiler has a difficult time automatically discovering parallelism and packing data into the wide vectors for you. If you can structure your code so that you already have taken this into account, then you will be able to fill more of the VLIW-5 slots and thus use more of your stream processors.
GCN is more like NVidia's Fermi architecture, where the code's path to the functional units (ALUs, etc.) of the stream processors does not go through explicitly scheduled VLIW instructions. So more parallelism can be automatically detected at runtime and keep your functional units busy doing useful work without you having to think as hard about how to make that happen.
Here's an over-simplified example to illustrate my point:
// multiply four factors
// A[0] = B[0] * C[0]
// ...
// A[3] = B[3] * C[3];
float *A, *B, *C;
for (i = 0; i < 4; i ++) {
A[i] = B[i] * C[i];
}
That code will probably run ok on a GCN architecture (except for suboptimal memory access performance--an advanced topic). But on your HD5870 it would be a disaster, because those four multiplies would take up 4 VLIW5 instructions instead of 1! So you would write that above code using the float4 type:
float4 A, B, C;
A = B * C;
And it would run really well on both of your cards. Plus it would kick ass on a CPU OpenCL context and make great use of MMX/SSE wide registers, a bonus. It's also a much better use of the memory system.
An a tiny nutshell, using the packed primitives is the one thing I can recommend to keep in mind as you start deploying code on these two systems at the same time.
Here's one more example that more clearly illustrates what you need to be careful doing on your HD5870. Say we had implemented the previous example using separate work-units:
// multiply four factors
// as separate work units
// A = B * C
float A, B, C;
A = B * C;
And we had four separate work units instead of one. That would be an absolute disaster on the VLIW device and would show tremendously better performance on the GCN device. That is something you will want to also look for when you are writing your code--can you use float4 types to reduce the number of work units doing the same work? If so, then you will see good performance on both platforms.

Related

Is a reduction or atomic operation on mat/vec types with OpenGL compute shader possible?

Is it possible to do reduction/update or atomic operations in the computer shader on e.g. mat3, vec3 data types?
Like this scheme:
some_type mat3 A;
void main() {
A += mat3(1);
}
I have tried out to use shader storage buffer objects (SSBO) but it seems like the update is not atomic (at least I get wrong results when I read back the buffer).
Does anyone have an idea to realize this? Maybe creating a tiny 3x3 image2D and store the result by imageAtomicAdd in there?
There are buffer-based atomics in GLES 3.1.
https://www.khronos.org/registry/gles/specs/3.1/es_spec_3.1.pdf
Section 7.7.
Maybe creating a tiny 3x3 image2D and store the result by imageAtomicAdd in there?
Image atomics are not core and require an extension.
Thank you for the links. I forgot to mention that I work with ARM Mali GPUs and as such they do not expose TLP and do not have warps/wave fronts as Nvidia or AMD. That is, I might have to figure out another quick way.
The techniques proposed in the comments for your post (in particular the log(N) divisor approach where you fold the top half of the results down) still work fine on Mali. The technique doesn't rely on warps/wavefronts - as the original poster said, you just need synchronization (e.g. use a barrier() rather than relying on the implicit barrier which wavefronts would give you).

Finding the effective switched capacitance of a processor

I need to determine the power consumption of a processor using the equation,
P = C*(V^2)*f
where C is the effective switched capacitance, V is the supply voltage and f is the processor frequency. Could someone please explain me where I could find sample C and V values for a typical processor?. I have gone through some of the Intel processor data sheets, but haven't been able to figure out the typical C values of a processor.
The following link has a generic explanation of the CPU power consumption. However, in order to perform this power calculation, I need to know CL values as specified in this answer.
Could someone please provide me some tips or useful links where I could get the values for an Intel processor?.
I suspect you can't use that power equation in any useful way to measure the power of a modern Intel processor. The voltage requirements should be constant due to modern design needs. You can pull the actual power requirements from the design specs -- again, the designers of the motherboard and power supplies need to know this. This implies that the capacitance of the processor varies moment by moment. Though this is possible, it also may mean that this model of the internals of the processor is wrong.
I'm not saying that the venerated P=C*V^2*f is wrong but that it probably doesn't apply. It gives you the power of a (CMOS?) transistor. Though you can scale it up by multiplying by the number of transistors in a processor (near 5.5B), I'm sure this is way off the mark. The power of a modern Intel processor can vary by at least a couple of orders of magnitude from moment to moment.
As an interesting set of exercises, (1) take the max and min power requirements from the spec, and compute the required C (linear with regards to P in the equation); and (2) get a spec for one switching device using the technology of a modern Intel processor, and multiply it by the # of such devices (>5.5B) to get the "required" power.

Alternative for dynamic parallelism for CUDA

I am very new to the CUDA programming model and programming in general, I suppose. I'm attempting to parallelize an expectation maximization algorithm. I am working on a gtx 480 which has compute capability 2.0. At first, I sort of assumed that there's no reason for the device to launch its own threads, but of course, I was sadly mistaken. I came across this pdf.
http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf
Unfortunately, dynamic parallelism only works on the latest and greatest GPUs, with compute capability 3.5. Without diving into too much specifics, what is the alternative to dynamic parallelism? The loops in the CPU EM algorithm have many dependencies and are highly nested, which seems to make dynamic parallelism an attractive ability. I'm not sure if my question makes sense so please ask if you need clarification.
Thank you!
As indicated by #JackOLantern, dynamic parallelism can be described in a nutshell as the ability to call a kernel (i.e. a __global__ function) from device code (a __global__ or __device__ function).
Since the kernel call is the principal method by which the machine spins up multiple threads in response to a single function call, there is really no direct alternative that provides all the capability of dynamic parallelism in a device that does not support it (ie. pre cc 3.5 devices).
Without dynamic parallelism, your overall code will almost certainly involve more synchronization and communication between CPU code and GPU code.
The principal method would be to realize some unit of your code as parallelizable, convert it to a kernel, and work through your code in essentially a non-nested fashion. Repetetive functions might be done via looping in the kernel, or else looping in the host code that calls the kernel.
For a pictorial example of what I am trying to describe, please refer to slide 14 of this deck which introduces some of the new features of CUDA 5 including dynamic parallelism. The code architecture on the right is an algorithm realized with dynamic parallelism. The architecture on the left is the same function realized without dynamic parallelism.
I have checked your algorithm in Wikipedia and I'm not sure you need dynamic parallelism at all.
You do the expectation step in your kernel, __syncthreads(), do the maximization step, and __syncthreads() again. From this distance, the expectation looks like a reduction primitive, and the maximization is a filter one.
If it doesn't work, and you need real task parallelism, a GPU may not be the best choice. While the Kepler GPUs can do that to some degree, this is not what this architecture is designed for. In that case you might be better off using a multi-CPU system, such as an office grid, a supercomputer, or a Xeon Phi accelerator. You should also check OpenMP and MPI, these are the languages used for task-parallel programming (actually OpenMP is just a handful of pragmas in most cases).

How to allocate video-card memory directly in Windows?

I am required to allocate bitmap in video-card memory in a project in Windows. Because the project use other 2d library other than GDI, so CreateCompatibleBitmap is useless.
Then i figure out a method using DX, here is my code:
if(FAILED(g_D3DDevice->CreateVertexBuffer(10240 * 1024, 0,
D3DFVF_VERTEX, D3DPOOL_DEFAULT, &g_VertexBuffer,
NULL))) return false;
// Fill the vertex buffer.
void *ptr;
if(FAILED(g_VertexBuffer->Lock(0, 1024 * 10240,
(void**)&ptr, 0))) return false;
//do something...
//printf("ptf = %x\n", ptr);
//memcpy(ptr, objData, sizeof(objData));
g_VertexBuffer->Unlock();
It works well so far. But are there any sideeffect?
Sideeffect:
You will have to unlock the VertexBuffer every single time you want to write or read to it
Slower then normal memory
Aside from this sideeffect make sure to check this process went sucessfull as video card out of memory does not supply the most "user obvious"-kind of error.
As for why it would be slow: by putting it into your video card memory you will have to request access to the vertex each time you want to write/read, which is slower then actually accessing it right away in regular memory.
I can't imagine any legitimate reason to want to do this but in general it is not possible anyway. Windows PCs have a wide range of hardware and many do not even have a concept of separate video memory. For those that do the performance characteristics are not guaranteed which is why Direct3D has lots of ways of indicating to the driver your intended usage so that the driver can decide where best to locate resources (amongst other things, like how best to lay them out in memory) rather than providing ways for you to specify 'in video memory' and other concepts that are not consistently defined across different hardware types.
If as you say your boss asked you to do this it seems like a case study in bad project management - specifying the implementation details of what to do rather than specifying the project goals and leaving the engineer tasked with implementation to figure out how best to achieve them. Presumably the real goal here is to improve performance of some task or operation. Almost certainly what you're doing is not the best way to achieve that.

Efficiency of branching in shaders

I understand that this question may seem somewhat ungrounded, but if someone knows anything theoretical / has practical experience on this topic, it would be great if you share it.
I am attempting to optimize one of my old shaders, which uses a lot of texture lookups.
I've got diffuse, normal, specular maps for each of three possible mapping planes and for some faces which are near to the user I also have to apply mapping techniques, which also bring a lot of texture lookups (like parallax occlusion mapping).
Profiling showed that texture lookups are the bottleneck of the shader and I am willing to remove some of them away. For some cases of the input parameters I already know that part of the texture lookups would be unnecessary and the obvious solution is to do something like (pseudocode):
if (part_actually_needed) {
perform lookups;
perform other steps specific for THIS PART;
}
// All other parts.
Now - here comes the question.
I do not remember exactly (that's why I stated the question might be ungrounded), but in some paper I recently read (unfortunately, can't remember the name) something similar to the following was stated:
The performance of the presented
technique depends on how efficient
the HARDWARE-BASED CONDITIONAL
BRANCHING is implemented.
I remembered this kind of statement right before I was about to start refactoring a big number of shaders and implement that if-based optimization I was talking about.
So - right before I start doing that - does someone know something about the efficiency of the branching in shaders? Why could branching give a severe performance penalty in shaders?
And is it even possible that I could only worsen the actual performance with the if-based branching?
You might say - try and see. Yes, that's what I'm going to do if nobody here is helps me :)
But still, what in the if case may be effective for new GPU's could be a nightmare for a bit older ones. And that kind of issue is very hard to forecast unless you have a lot of different GPU's (that's not my case)
So, if anyone knows something about that or has benchmarking experience for these kinds of shaders, I would really appreciate your help.
Few remaining brain cells that are actually working keep telling me that branching on the GPU's might be far not as effective as branching for the CPU (which usually has extremely efficient ways of branch predictions and eliminating cache misses) simply because it's a GPU (or that could be hard / impossible to implement on the GPU).
Unfortunately I am not sure if this statement has anything in common with the real situation...
If the condition is uniform (i.e. constant for the entire pass), then the branch is essentially free because the framework will essentially compile two versions of the shader (branch taken and not) and choose one of these for the entire pass based on your input variable. In this case, definitely go for the if statement as it will make your shader faster.
If the condition varies per vertex/pixel, then it can indeed degrade performance and older shader models don't even support dynamic branching.
Unfortunately, I think the real answer here is to do practical testing with a performance analyser of your specific case, on your target hardware. Particularly given that it sounds like you're at project optimisation stage; this is the only way to take into account the fact that hardware changes frequently and the nature of the specific shader.
On a CPU, if you get a mispredicted branch, you'll cause a pipeline flush and since CPU pipelines are so deep, you'll effectively lose something in the order of 20 or more cycles. On the GPU things a little different; the pipeline are likely to be far shallower, but there's no branch prediction and all of the shader code will be in fast memory -- but that's not the real difference.
It's difficult to know the exact details of everything that's going on, because nVidia and ATI are relatively tight-lipped, but the key thing is that GPUs are made for massively parallel execution. There are many asynchronous shader cores, but each core is again designed to run multiple threads. My understanding is that each core expects to run the same instruction on all it's threads on any given cycle (nVidia calls this collection of threads a "warp").
In this case, a thread might represent a vertex, a geometry element or a pixel/fragment and a warp is a collection of about 32 of those. For pixels, they're likely to be pixels that are close to each other on screen. The problem is, if within one warp, different threads make different decisions at the conditional jump, the warp has diverged and is no longer running the same instruction for every thread. The hardware can handle this, but it's not entirely clear (to me, at least) how it does so. It's also likely to be handled slightly differently for each successive generation of cards. The newest, most general CUDA/compute-shader friendly nVidias might have the best implementation; older cards might have a poorer implementation. The worse case is you may find many threads executing both sides of if/else statements.
One of the great tricks with shaders is learning how to leverage this massively parallel paradigm. Sometimes that means using extra passes, temporary offscreen buffers and stencil buffers to push logic up out of the shaders and onto the CPU. Sometimes an optimisation may appear to burn more cycles, but it could actually be reducing some hidden overhead.
Also note that you can explicitly mark if statements in DirectX shaders as [branch] or [flatten]. The flatten style gives you the right result, but always executes all in the instructions. If you don't explicitly choose one, the compiler can choose one for you -- and may pick [flatten], which is no good for your example.
One thing to remember is that if you jump over the first texture lookup, this will confuse the hardware's texture coordinate derivative math. You'll get compiler errors and it's best not to do so, otherwise you might miss out on some of the better texturing support.
In many cases the both branches could be calculated and mixed by condition as interpolator.
That approach works much faster than branch. Could be used on CPU also.
For instance:
...
vec3 c = vec3(1.0, 0.0, 0.0);
if (a == b)
c = vec3(0.0, 1.0, 0.0);
could be replaced by:
vec3 c = mix(vec3(1.0, 0.0, 0.0), vec3(0.0, 1.0, 0.0), (a == b));
...
Here's a real world performance benchmark on a kindle Fire:
In the fragment shader...
This runs at 20fps:
lowp vec4 a = vec4(0.0, 0.0, 0.0, 0.0);
if (a.r == 0.0)
gl_FragColor = texture2D ( texture1, TextureCoordOut );
This runs at 60fps:
gl_FragColor = texture2D ( texture1, TextureCoordOut );
I don't know about the if-based optimizations, but how about just creating all the permutations of the texture-lookups that you think you'll need, each its own shader, and just use the right shader for the right situation (depending on which texture lookups a particular model, or part of your model, needed). I think we did something like this on Bully for Xbox 360.

Resources