Performance issues with LDS memory in OpenCL - performance

I have a performance problem when using LDS memory with AMD Radeon HD 6850.
I have two kernels as parts of a N-particle simulation. Each work unit has to calculate force which acts on a corresponding particle based on relative position to other particles. The problematic kernel is:
#define UNROLL_FACTOR 8
//Vernet velocity part kernel
__kernel void kernel_velocity(const float deltaTime,
__global const float4 *pos,
__global float4 *vel,
__global float4 *accel,
__local float4 *pblock,
const float bound)
{
const int gid = get_global_id(0); //global id of work item
const int id = get_local_id(0); //local id of work item within work group
const int s_wg = get_local_size(0); //work group size
const int n_wg = get_num_groups(0); //number of work groups
const float4 myPos = pos[gid];
const float4 myVel = vel[gid];
const float4 dt = (float4)(deltaTime, deltaTime, 0.0f, 0.0f);
float4 acc = (float4)0.0f;
for (int jw = 0; jw < n_wg; ++jw)
{
pblock[id] = pos[jw * s_wg + id]; //cache a particle position; position in array: workgroup no. * size of workgroup + local id
barrier (CLK_LOCAL_MEM_FENCE); //wait for others in the work group
for (int i = 0; i < s_wg; )
{
#pragma unroll UNROLL_FACTOR
for (int j = 0; j < UNROLL_FACTOR; ++j, ++i)
{
float4 r = myPos - pblock[i];
float rSizeSquareInv = native_recip (r.x*r.x + r.y*r.y + 0.0001f);
float rSizeSquareInvDouble = rSizeSquareInv * rSizeSquareInv;
float rSizeSquareInvQuadr = rSizeSquareInvDouble * rSizeSquareInvDouble;
float rSizeSquareInvHept = rSizeSquareInvQuadr * rSizeSquareInvDouble * rSizeSquareInv;
acc += r * (2.0f * rSizeSquareInvHept - rSizeSquareInvQuadr);
}
}
barrier(CLK_LOCAL_MEM_FENCE);
}
acc *= 24.0f / myPos.w;
//update velocity only
float4 newVel = myVel + 0.5f * dt * (accel[gid] + acc);
//write to global memory
vel[gid] = newVel;
accel[gid] = acc;
}
The simulation runs fine in terms of results, but the problem is in the performance when using the local memory for caching the particle positions to relieve the big amount of reading from the global memory. Actually if the line
float4 r = myPos - pblock[i];
is replaced by
float4 r = myPos - pos[jw * s_wg + i];
the kernel runs faster. I don't really get that since reading from global should be much slower than reading from local.
Moreover, when the line
float4 r = myPos - pblock[i];
is removed completely and all following occurences of r are replaced by myPos - pblock[i], the speed is the same as before as if the line was not there at all. This I don't get even more as accessing private memory in r should be the fastest but the compiler somehow "optimizes" this line out.
Global work size is 4608, local worksize is 192. It is run with AMD APP SDK v2.9 and Catalyst drivers 13.12 in Ubuntu 12.04.
Can anyone please help me with this? Is that my fault or is that a problem of the GPU / drivers / ... ? Or is it a feature? :-)

I'm gonna make a wild guess:
When using float4 r = myPos - pos[jw * s_wg + i]; the compiler is smart enough to notice that the barrier put after the initialization of pblock[id] is not necessary anymore and remove it. Very likely all these barriers (in the for loop) impact your performances, so removing them is very noticeable.
Yeah but global access cost a lot too...So I'm guessing that behind the scene cache memories are well utilized. There is also the fact that you use vectors and as a matter of fact the architecture of the AMD Radeon HD 6850 uses VLIW processors...maybe it helps also to make a better use of the cache memories...maybe.
EDIT:
I've just found out a article benchmarking GPU/APU Cache and Memory Latencies. Your GPU is in the list. You might get some more answers (sorry didn't really read it - too tired).

After some more digging it turned out that the code causes some LDS bank conflicts. The reason is that for AMD there are 32 banks with 4 bytes length, but the float4 covers 16 bytes and therefore the half-wavefront accesses different addresses in the same banks. The solution was to make __local float* for x and y coordinates separately and read them also separately with the proper shift of array index as (id + i) % s_wg. Nevertheless, the overall gain in performance is small, most likely due to the overall latencies described in the link provided by #CaptainObvious (well then one has to increase the global work size to hide them).

Related

How to improve FPS and overcome memory bandwidth by random access on textures?

In my virtual reality program I am heavily bound by memory bandwidth:
#version 320 es
precision lowp float;
const int n_pool = 30;
layout(local_size_x = 8, local_size_y = 16, local_size_z = 1) in;
layout(rgba8, binding = 0) writeonly uniform lowp image2D image;
layout(rgba8, binding = 1) readonly uniform lowp image2DArray pool;
uniform mat3 RT[n_pool]; // <- this is a rotation-translation matrix
void main() {
uint u = gl_GlobalInvocationID.y;
uint v = gl_GlobalInvocationID.x;
vec4 Ir = imageLoad(pool, ivec3(u,v,29));
float cost = 1.0/0.0;
for (int j = 0; j < 16; j++) {
float C = 0.0;
for (int i = 0; i < n_pool; i++) {
vec3 w = RT[i]*vec3(u,v,j);
C += length(imageLoad(pool, ivec3(w[0],w[1],i)) - Ir);
}
}
cost = C < cost ? C : cost;
}
imageStore(image, ivec2(u,v), vec4(cost, cost, cost, 1.0));
}
You can see that I have a lot of random accesses on a TEXTURE_2D_ARRAY (width = 320, height = 240, layers = 30). However, the access is not so random, because it will be in the proximity of u,v.
Here are my thoughts:
another texture format instead rgba-floats (rgba-unsigned byte maybe?).
the shared memory is too small to even store one gray scale image.
changing loop order. Strangely, this ordering is faster although the other should to have a better caching behaviour.
resizing work groups to fit the textures better.
using compressed images (Unlikely scenario giving performance boost). In theory however, that should help with the bandwidth.
What are your thoughts?
Do you have any actual data which shows that the issue you have is texture
bandwidth, or is that just an assumption?
I can see a number of issues which mean that may well not actually be your problem. For example:
vec3 w = RT[i]*vec3(u,v,j);
... you have a mat3 array load in inside your inner loop, so on most architectures I know you probably are uniform fetch bound, not texture bound. This should cache well in the GPU data cache, but is probably still being refetched per loop iteration, which smells a lot more expensive than a single imageLoad() unless you texture format is exceptionally wide ...
If you are using fp16 or fp32 RGBA texture inputs, then narrower 8-bit unorm formats are always going to be faster (fp32 is particularly expensive).
For the following:
cost = C < cost ? C : cost;
... it's probably more reliable in terms of code generation to use the min() built in function.
UPDATE 1
moving from conventional pixel pipeline to compute shader brought 3x speedup
UPDATE 2
using compressed formats increases 5% FPS
UPDATE 3
In this simplified version I didn't show that I indeed created many temporary vectors on-the-fly (both, in the outer and the inner loop). By removing mat3/vec4/vec3 creation within loop brought 2x speedup. Very, surprising to me that creating vectors in loops is that expensive.
Now I am well deep in real-time and fulfilled my goal...

Two float[] outputs in one kernel pass (Sobel -> Magnitude and Direction)

I wrote the following rs code in order to calculate the magnitude and the direction within the same kernel as the sobel gradients.
#pragma version(1)
#pragma rs java_package_name(com.example.xxx)
#pragma rs_fp_relaxed
rs_allocation bmpAllocIn, direction;
int32_t width;
int32_t height;
// Sobel, Magnitude und Direction
float __attribute__((kernel)) sobel_XY(uint32_t x, uint32_t y) {
float sobX=0, sobY=0, magn=0;
// leave a border of 1 pixel
if (x>0 && y>0 && x<(width-1) && y<(height-1)){
uchar4 c11=rsGetElementAt_uchar4(bmpAllocIn, x-1, y-1); uchar4 c12=rsGetElementAt_uchar4(bmpAllocIn, x-1, y);uchar4 c13=rsGetElementAt_uchar4(bmpAllocIn, x-1, y+1);
uchar4 c21=rsGetElementAt_uchar4(bmpAllocIn, x, y-1);uchar4 c23=rsGetElementAt_uchar4(bmpAllocIn, x, y+1);
uchar4 c31=rsGetElementAt_uchar4(bmpAllocIn, x+1, y-1);uchar4 c32=rsGetElementAt_uchar4(bmpAllocIn, x+1, y);uchar4 c33=rsGetElementAt_uchar4(bmpAllocIn, x+1, y+1);
sobX= (float) c11.r-c31.r + 2*(c12.r-c32.r) + c13.r-c33.r;
sobY= (float) c11.r-c13.r + 2*(c21.r-c23.r) + c31.r-c33.r;
float d = atan2(sobY, sobX);
rsSetElementAt_float(direction, d, x, y);
magn= hypot(sobX, sobY);
}
else{
magn=0;
rsSetElementAt_float(direction, 0, x, y);
}
return magn;
}
And the Java part:
float[] gm = new float[width*height]; // gradient magnitude
float[] gd = new float[width*height]; // gradient direction
ScriptC_sobel script;
script=new ScriptC_sobel(rs);
script.set_bmpAllocIn(Allocation.createFromBitmap(rs, bmpGray));
// dirAllocation: reference to the global variable "direction" in rs script. This
// dirAllocation is actually the second output of the kernel. It will be "filled" by
// the rsSetElementAt_float() method that include a reference to the current
// element (x,y) during the passage of the kernel.
Type.Builder TypeDir = new Type.Builder(rs, Element.F32(rs));
TypeDir.setX(width).setY(height);
Allocation dirAllocation = Allocation.createTyped(rs, TypeDir.create());
script.set_direction(dirAllocation);
// outAllocation: the kernel will slide along this global float Variable, which is
// "formally" the output (in principle the roles of the outAllocation (magnitude) and the
// second global variable direction (dirAllocation)could have been switched, the kernel
// just needs at least one in- or out-Allocation to "slide" along.)
Type.Builder TypeOut = new Type.Builder(rs, Element.F32(rs));
TypeOut.setX(width).setY(height);
Allocation outAllocation = Allocation.createTyped(rs, TypeOut.create());
script.forEach_sobel_XY(outAllocation); //start kernel
// here comes the problem
outAllocation.copyTo(gm) ;
dirAllocation.copyTo(gd);
In a nutshell: this code works for my older Galaxy Tab2 (API17) but it creates a crash (Fatal signal 7 (SIGBUS), code 2, fault addr 0x9e6d4000 in tid 6385) with my Galaxy S5 (API 21). The strange thing is that when I use a simpler Kernel that just calculates SobelX or SobelY gradients in the very same way (except the 2nd allocation, here for the direction), it works also on the S5. Thus, the Problem cannot be some compatibility issue. Also, as I said, the kernel itself passes without problems (I can log the Magnitude and direction values) but it struggles with the above .copyTo Statements. As you can see the gm and gd floats have the same dimensions (width*height) as all other allocations used by the kernel. Any idea what the Problem could be? Or is there an alternative, more robust way to do the whole Story?

metal compute function limitations

I experienced that MTLBuffers with computionally intensive shader functions tend to stop calculating before all threadgroups are done. When I use a MTLComputePipelineState and MTLComputeCommandEncoder to blur an image with very large blur radii the resulting image half way processed and one can actually see half finished threadgroups. I did not narrow it down to the exact amount of blur radius, but 16 pixels works fine, 32 is already too much and not even half the groups are computed.
So are there any limitations on how long a shader function call should take to finish or anything like that? I just finished most of the documentation about how to use the Metal framework and I cannot recall stumbling upon any such statements.
EDIT
Since in my case the problem was not a simple timeout but some internal error I'm going to add some code.
The most expensive part of is the block-matching algorithm that finds matching blocks in two images (i.e consecutive frames in a movie)
//Exhaustive Search Block-matching algorithm
kernel void naiveMotion(
texture2d<float,access::read> inputImage1 [[ texture(0) ]],
texture2d<float,access::read> inputImage2 [[ texture(1) ]],
texture2d<float,access::write> outputImage [[ texture(2) ]],
uint2 gid [[ thread_position_in_grid ]]
)
{
//area to search for matches
float searchSize = 10.0;
int searchRadius = searchSize/2;
//window size to search in
int kernelSize = 6;
int kernelRadius = kernelSize/2;
//this will store the motion direction
float2 vector = float2(0.0,0.0);
float2 maxVector = float2(searchSize,searchSize/2);
float maxVectorLength = length(maxVector);
//maximum error caused by noise
float error = kernelSize*kernelSize*(10.0/255.0);
for (int y = -searchRadius; y < searchRadius; ++y)
{
for (int x = 0; x < searchSize; ++x)
{
float diff = 0;
for (int b = - kernelRadius; b < kernelRadius; ++b)
{
for (int a = - kernelRadius; a < kernelRadius; ++a)
{
uint2 textureIndex(gid.x + x + a, gid.y + y + b);
float4 targetColor = inputImage2.read(textureIndex).rgba;
float4 referenceColor = inputImage1.read(gid).rgba;
float targetGray = 0.299*targetColor.r + 0.587*targetColor.g + 0.114*targetColor.b;
float referenceGray = 0.299*referenceColor.r + 0.587*referenceColor.g + 0.114*referenceColor.b;
diff = diff + abs(targetGray - referenceGray);
}
}
if ( error > diff )
{
error = diff;
//vertical motion is rather irrelevant but negative values can't be stored so just take the absolute value
vector = float2(x, abs(y));
}
}
}
float intensity = length(vector)/maxVectorLength;
outputImage.write(float4(normalize(vector), intensity, 1),gid);
}
I am using that shader on a 960x540px image. With a searchSize of 9 and kernelSize of 8 the shader runs over the whole image. Changing the searchSize to 10 and the shader will stop early with an error code 1.

Optimizing 2D convolution filter with C++ AMP

I'm fairly new to GPU programming and C++ AMP. Can anyone help make a general optimized 2D image convolution filter? My fasted version so far is listed below. Can this be done better with tiling in some way?
This version works and is much faster than my CPU implementation but I hope to get it even better.
void FIRFilterCore(array_view<const float, 2> src, array_view<float, 2> dst, array_view<const float, 2> kernel)
{
int vertRadius = kernel.extent[0] / 2;
int horzRadius = kernel.extent[1] / 2;
parallel_for_each(src.extent, [=](index<2> idx) restrict(amp)
{
float sum = 0;
if (idx[0] < vertRadius || idx[1] < horzRadius ||
idx[0] >= src.extent[0] - vertRadius || idx[1] >= src.extent[1] - horzRadius)
{
// Handle borders by duplicating edges
for (int dy = -vertRadius; dy <= vertRadius; dy++)
{
index<2> srcIdx(direct3d::clamp(idx[0] + dy, 0, src.extent[0] - 1), 0);
index<2> kIdx(vertRadius + dy, 0);
for (int dx = -horzRadius; dx <= horzRadius; dx++)
{
srcIdx[1] = direct3d::clamp(idx[1] + dx, 0, src.extent[1] - 1);
sum += src[srcIdx] * kernel[kIdx];
kIdx[1]++;
}
}
}
else // Central part
{
for (int dy = -vertRadius; dy <= vertRadius; dy++)
{
index<2> srcIdx(idx[0] + dy, idx[1] - horzRadius);
index<2> kIdx(vertRadius + dy, 0);
for (int dx = -horzRadius; dx <= horzRadius; dx++)
{
sum += src[srcIdx] * kernel[kIdx];
srcIdx[1]++;
kIdx[1]++;
}
}
}
dst[idx] = sum;
});
}
Another way to go around it would of course be to perform the convolution in the Fourier domain, but I'm not sure it would perform as long as the filter is fairly small compared to the image (which does not have side lengths which are powers of 2 by the way).
You can find a complete implementation of the Cartoonizer algorithm. which implements a couple of stencil based algorithms on Codeplex. http://ampbook.codeplex.com/
This includes several different implementations. The tradeoffs associated with them are discussed in the book that the samples were written for.
For the minimum frame processor settings (1 simplifier phase and a
border width of 1), there is insufficient shared memory access to
take advantage of tiled memory. This is clearly shown by comparing the
times taken by the cartoonizing stage for the C++ AMP simple model
(4.9 ms) and the tiled model (4.2 ms) running on a single GPU. You
would expect the tiled implementation to execute more quickly, but
it's comparable. For the default and maximum frame processor settings,
tiled memory becomes more beneficial and the tiled model processors
execute faster than the simple model ones.
There was a similar question here:
Several arithmetic operations pararellized in C++Amp
I posted some code there which shows a filter with a variable size.

DirectX 11 Compute Shader - not writing all values

I am trying some experiments in fractal rendering with DirectX11 Compute Shaders.
The provided example runs on a FeatureLevel_10 device.
My RwStructured output buffer has a data format of R32G32B32A32_FLOAT
The problem is that when writing to the buffer, it seems that only the ALPHA ( w ) value gets written nothing else....
Here is the shader code:
struct BufType
{
float4 value;
};
cbuffer ScreenConstants : register(b0)
{
float2 ScreenDimensions;
float2 Padding;
};
RWStructuredBuffer<BufType> BufferOut : register(u0);
[numthreads(1, 1, 1)]
void Main( uint3 DTid : SV_DispatchThreadID )
{
uint index = DTid.y * ScreenDimensions.x + DTid.x;
float minRe = -2.0f;
float maxRe = 1.0f;
float minIm = -1.2;
float maxIm = minIm + ( maxRe - minRe ) * ScreenDimensions.y / ScreenDimensions.x;
float reFactor = (maxRe - minRe ) / (ScreenDimensions.x - 1.0f);
float imFactor = (maxIm - minIm ) / (ScreenDimensions.y - 1.0f);
float cim = maxIm - DTid.y * imFactor;
uint maxIterations = 30;
float cre = minRe + DTid.x * reFactor;
float zre = cre;
float zim = cim;
bool isInside = true;
uint iterationsRun = 0;
for( uint n = 0; n < maxIterations; ++n )
{
float zre2 = zre * zre;
float zim2 = zim * zim;
if ( zre2 + zim2 > 4.0f )
{
isInside = false;
iterationsRun = n;
}
zim = 2 * zre * zim + cim;
zre = zre2 - zim2 + cre;
}
if ( isInside )
{
BufferOut[index].value = float4(1.0f,0.0f,0.0f,1.0f);
}
}
The code actually produces in a sense the correct result ( 2D Mandelbrot set ) but it seems somehow only the alpha value is touched and nothing else is written, although the pixels inside the set should be colored red... ( the image is black & white )
Anybody has a clue what's going on here ?
After some fiddling around i found the problem.
I have not found any documentation from MS mentioning this, so it could also be a Nvidia
specific driver issue.
Apparently you are only allowed to write ONCE per Compute Shader Invocation to the same element in a RWSructuredBuffer. And you also HAVE to write ONCE.
I changed the code to accumulate the correct color in a local variable, and write it now only once at the end of the shader.
Everything works perfectly now in that way.
I'm not sure but, shouldn't it be for BufferOut decl:
RWStructuredBuffer<BufType> BufferOut : register(u0);
instead of :
RWStructuredBuffer BufferOut : register(u0);
If you are only using a float4 write target, why not use just:
RWBuffer<float4> BufferOut : register (u0);
Maybe this could help.
After playing around today again, i ran into the same problem once again.
The following code produced all white output:
[numthreads(1, 1, 1)]
void Main( uint3 dispatchId : SV_DispatchThreadID )
{
float4 color = float4(1.0f,0.0f,0.0f,1.0f);
WriteResult(dispatchId,color);
}
The WriteResult method is a utility method from my hlsl standard library.
Long story short. After i upgraded from Driver version 192 to 195(beta) the problem went away.
Seems like the drivers have some definitive problems in compute shader support left, so beware.
from what ive seen, computer shaders are only useful if you need a more general computational model than the tradition pixel shader, or if you can load data and then share it between threads in fast shared memory. im fairly sure u would get better performance with a pixel shader for the mandelbrot shader.
on my setup (win7, feb 10 dx sdk, gtx480) my compute shaders have a punishing setup time of over 0.2-0.3ms (binding a SRV and a UAV and then calling dispatch()).
if u do a PS implementation please post your experiences.
I have no direct experience with DX compute shaders but...
Why are you setting alpha = 1.0?
IIRC, that makes the pixel 100% transparent, so your inside pixels are transparent red, and show up as whatever color was drawn behind them.
When alpha = 1.0, the RGB components are never used.

Resources