observation about false sharing - caching

I have an observation about false sharing, please look this code:
struct foo {
int x;
int y;
};
static struct foo f;
/* The two following functions are running concurrently: */
int sum_a(void)
{
int s = 0;
int i;
for (i = 0; i < 1000000; ++i)
s += f.x;
return s;
}
void inc_b(void)
{
int i;
for (i = 0; i < 1000000; ++i)
++f.y;
}
Here, sum_a may need to continually re-read x from main memory
(instead of from cache) even though inc_b's concurrent modification of
y should be irrelevant.
By Oracle's website I read this explanation:
simultaneous updates of individual elements in the same cache line
coming from different processors invalidates entire cache lines, even
though these updates are logically independent of each other. Each
update of an individual element of a cache line marks the line as
invalid. Other processors accessing a different element in the same
line see the line marked as invalid. They are forced to fetch a more
recent copy of the line from memory or elsewhere, even though the
element accessed has not been modified. This is because cache
coherency is maintained on a cache-line basis, and not for individual
elements.As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.
My observation is simple:
if what i read is true, how can be possible that exist the problem about "dirty read" ???

Related

Minimize the number of page faults by loop interchange

Assume page size is 1024 words and each row is stored in one page.
If the OS allocates 512 frames for a program and uses LRU page replacement algorithm,
What will be the number of page faults in the following programs?
int A[][] = new int[1024][1024];
Program 1:
for (j = 0; j < A.length; j++)
for (i = 0; i < A.length; i++)
A[i][j] = 0;
Program 2:
for (i = 0; i < A.length; i++)
for(j = 0; j < A.length; j++)
A[i][j] = 0;
I assume that bringing the pages by row is better than bringing by column, however I cannot support my claim. Can you help me to calculate # of page faults?
One way to answer this is by simulation. You could change your loops to output the address of the assignment rather than setting it to zero:
printf("%p\n", &A[i][j]);
Then, write a second program that simulates page placement, so it would do something like:
uintptr_t h;
uintptr_t work[NWORKING_SET] = {0};
int lru = 0;
int fault = 0;
while (gethex(&h)) {
h /= pagesize;
int i;
for (i = 0; i < NWORKING_SET && work[i] != h; i++) {
}
if (i == NWORKING_SET) {
work[lru] = h;
fault++;
lru = (lru+1) % NWORKING_SET;
}
}
printf("%d\n", fault);
With that program in place, you can try multiple traversal strategies. PS: my lru just happens to work; I'm sure you can do much better.
For the second program; the CPU accesses the first int in a row causing a page fault, then accesses the other ints in the row while the page is already present. This means that (if the rows start on page boundaries) you'll get a page fault per row, plus probably one when the program's code is first started, plus probably another when the program's stack is first used (and one more if the array isn't aligned on a page boundary); which probably works out to 1026 or 1027 page faults.
For the first program; the CPU accesses the first int in a row causing a page fault; but by the time it accesses the second int in that same row the page has been evicted (become "least recently used" and replaced with a different page). This means that you'll get 1024*1024 page faults while accessing the array (plus one for program's code, stack, etc). That probably works out to 1048578 page faults (as long as the start of the array is aligned to "sizeof(int)").
However; this all assumes that the compiler failed to optimize anything. In reality it's extremely likely that any compiler that is worth using would have transformed both programs into something a little more like "memset(array, 0, sizeof(int)*1024*1024); that does consecutive writes (possibly writing multiple ints in a single larger write if underlying CPU supports larger writes). This implies that both programs would take probably cause 1026 or 1027 page faults.

CUDA kernel with single branch runs 1.5x faster than kernel without branch

I've got a strange performance inversion on filter kernel with and without branching. Kernel with branching runs ~1.5x faster than the kernel without branching.
Basically I need to sort a bunch of radiance rays then apply interaction kernels. Since there are a lot of accompanying data, I can't use something like thrust::sort_by_key() many times.
Idea of the algorithm:
Run a loop for all possible interaction types (which is five)
At every cycle a warp thread votes for its interaction type
After loop completion every warp thread knows about another threads with the same interaction type
Threads elect they leader (per interaction type)
Leader updates interactions offsets table using atomicAdd
Each thread writes its data to corresponding offset
I used techniques described in this Nvidia post https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/
My first kernel contains a branch inside loop and runs for ~5ms:
int active;
int leader;
int warp_progress;
for (int i = 0; i != hit_interaction_count; ++i)
{
if (i == decision)
{
active = __ballot(1);
leader = __ffs(active) - 1;
warp_progress = __popc(active);
}
}
My second kernel use lookup table of two elements, use no branching and runs for ~8ms:
int active = 0;
for (int i = 0; i != hit_interaction_count; ++i)
{
const int masks[2] = { 0, ~0 };
int mask = masks[i == decision];
active |= (mask & __ballot(mask));
}
int leader = __ffs(active) - 1;
int warp_progress = __popc(active);
Common part:
int warp_offset;
if (lane_id() == leader)
warp_offset = atomicAdd(&interactions_offsets[decision], warp_progress);
warp_offset = warp_broadcast(warp_offset, leader);
...copy data here...
How can that be? Is there any way to implement such filter kernel so it will run faster than branching one?
UPD: Complete source code can be found in filter_kernel cuda_equation/radiance_cuda.cu at https://bitbucket.org/radiosity/engine/src
I think this is CPU programmer brain deformation. On CPU I expect performance boost because of eliminated branch and branch misprediction penalty.
But there is no branch prediction on GPU and no penalty, so only instructions count matters.
First I need to rewrite code to the simple one.
With branch:
int active;
for (int i = 0; i != hit_interaction_count; ++i)
if (i == decision)
active = __ballot(1);
Without branch:
int active = 0;
for (int i = 0; i != hit_interaction_count; ++i)
{
int mask = 0 - (i == decision);
active |= (mask & __ballot(mask));
}
In first version there are ~3 operations: compare, if and __ballot().
In second version there are ~5 operations: compare, make mask, __ballot(), & and |=.
And there are ~15 ops in common code.
Both loops runs for 5 cycles. It total 35 ops in first, and 45 ops in second. This calculation can explain performance degradation.

Are std::get<> and std::tuple<> slower then raw pointers?

I have an C++11 application where I commonly iterate over several different structure of arrays for various algorithms. Raw CPU performance is important for this app.
The array elements are fundamental types (int, double, ..) or simple struct. The array are typically tens of thousands of elements long. I often need to iterate several arrays at once in a given loop. So typically I would need one pointer for each array of whatever type. So times I need to increment five individual pointers which is verbose.
Based on these answers about tuples,
Why is std::pair faster than std::tuple
C++11 tuple performance
I hoped there was no overhead to using tuples to pack the pointers together into a single object.
I thought it might be nice to implement a cursor like object to assist in iterating, since missing the increment on a particular pointer would be an annoying bug.
auto pts = std::make_tuple(p1, p2, p3...);
allow you to bundle a bunch of variables together in a typesafe way. Then you can implement a variadic template function to increment each pointer in the tuple in a type safe way.
However...
When I measure performance, the tuple version was slower then using raw pointers. But when I look at the generated assembly I see additional mov instructions in the tuple loop increment. Maybe due to the fact the std::get<> returns a reference? I had hoped that would be compiled away...
Am I missing something or are raw pointers just going to beat tuples when used like this? Here is a simple test harness. I threw away the fancy cursor code and just use a std::tuple<> for this test
On my machine, the tuple loop is consistently twice as slow as the raw pointer version for various data sizes.
My system config is Visual C++ 2013 x64 on Windows 8 with a release build. I did try turning on various optimization in Visual Studio such as
Inline Function Expansion : Any Suitable (/Ob2)
but it did not seem to change the time result for my case.
I did need to do two extra things to avoid aggressive optimization by VS
1) I forced the test data array to allocated on the heap, not the stack. That made a big difference when I timed things, possibly due to memory cache effects.
2) I forced a side effect by writing to static variable at the end so the compiler would not just skip my loop.
struct forceHeap
{
__declspec(noinline) int* newData(int M)
{
int* data = new int[M];
return data;
}
};
void timeSumCursor()
{
static int gIntStore;
int maxCount = 20;
int M = 10000000;
// compiler might place array on stack which changes the timing
// int* data = new int[N];
forceHeap fh;
int* data = fh.newData(M);
int *front = data;
int *end = data + M;
int j = 0;
for (int* p = front; p < end; ++p)
{
*p = (++j) % 1000;
}
{
BEGIN_TIMING_BLOCK("raw pointer loop", maxCount);
int* p = front;
int sum = 0;
int* cursor = front;
while (++cursor != end)
{
sum += *cursor;
}
gIntStore = sum;// force a side effect
END_TIMING_BLOCK();
}
printf("%d\n", gIntStore);
{
// just use a simple tuple to show the issue
// rather full blown cursor object
BEGIN_TIMING_BLOCK("tuple loop", maxCount);
int sum = 0;
auto cursor = std::make_tuple(front);
while (++std::get<0>(cursor) != end)
{
sum += *std::get<0>(cursor);
}
gIntStore = sum; // force a side effect
END_TIMING_BLOCK();
}
printf("%d\n", gIntStore);
delete[] data;
}

Understanding Cache memory

I am trying to understand how cache memory reads and writes. Also I am trying to determine the hit and miss rate. I have tried reading and reading the textbook "Computer Systems - A Programmer Perspective" over and over and can't seem to grasp this idea. Maybe someone can help me understand this:
I am working with a two-dimensional array which has 480 rows and 640 columns. The cache is direct-mapped and 64 KB with 4 byte lines. Below is the C-code:
struct pixel {
char r;
char g;
char b;
char a;
};
struct pixel buffer[480][640];
register int i, j;
register char *cptr;
register int *iptr;
sizeof(char) == 1 (meaning an index in the array consists of 4 byte each (if I am understanding that correctly)). The buffer begins at memory address 0 and the cache is initially empty (cold cache). The only memory accesses are to the entries of the array. All other variables are stored in registers.
for (j=0; j < 640; j++) {
for (i=0; i < 480; i++){
buffer[i][j].r = 0;
buffer[i][j].g = 0;
buffer[i][j].b = 0;
buffer[i][j].a = 0;
}
}
For the code above then it is initializing all the elements in the array to 0, so it must be writing. I can see that this is bad locality because the array is writing column by column instead of row by row. Doesn't that affect the miss rate? I am trying to determine the miss rate for this code based on the cache size. I think the miss rate is 100% and if the locality was row by row then it would be 25%. But I am not totally understanding how cache-memory works so... Can anyone tell me something that could help me understand this better?
I would recommend you to watch the whole Tutorial if you are a beginner.
But for your question, lecture 27 to 31 would explain everything.
https://www.youtube.com/watch?v=tGarzP488Wc&index=29&list=PL2F82ECDF8BB71B0C
IISc Bangalore.

Fastest way to simulate recurrent equations in OpenCL

I am trying to simulate a recurrent equation of the type
si(t+1) = f[Σj Wijsj(t) + vi*input(t)]
in OpenCL, where f(.) is some non-linear function (in the code below it is just a step-function with threshold th) and s(t) is some external input. Naturally, I implemented one worker for every xi. In every time-step every worker calculates the result of the equation above and subsequently this result is shared with all other workers. Therefore, all workers have to be in the same workgroup.
My current OpenCL kernel looks like this
__kernel void part1(__global int* s, __global float* W, __global float* Th, __global float* V, __global float* input, int N, int T)
{
unsigned int i = get_global_id(0);
float value = 0;
float v = V[i];
float th = Th[i];
for(int t = 0; t < T; t++){
value = v*input[t];
for(int j = 0; j < N; j++){
value = value + W[i*N + j]*s[j];
}
barrier(CLK_GLOBAL_MEM_FENCE);
if (value >= th){
s[i] = 1;
} else {
s[i] = 0;
}
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
Unfortunately, this code is actually three times slower than an equivalent C-implementation. Also, I expected that a change in the number of workers should not make a huge difference (because new workers are sitting on new threads that run in parallel to the others), but actually the processing time increases linearly with the number of workers. The bottleneck seems to be the writing operation after the first barrier. Eliminating this operation (but leaving the barrier in place) cuts down the processing time by a factor of 25 and eliminates the linear dependence.
I am pretty new to OpenCL and I would appreciate any help to speed this code up!
Thanks a lot in advance!
Blue2script
As I already stated in my comment, accessing global memory is slow. Usually the hardware hides the latency by having several subgroups of threads running on the same compute unit. The subgroups I'm referring to are call warps in the NVIDIA lingo and wavefronts in the AMD one. Usually a workgroup is composed of several subgroups.
So meanwhile one subgroup waits to receive the data from the global memory another one that already has all the necessary resources can run. When the running one gets stalled because it needs to read/write data from/to the global memory another one can start running and so on.
However, in your case, because of the barriers, all the workers in all the subgroups will have to that the others wrote in the memory before being able to continue the computation (the barrier is at the workgroup level). Hence the latency hits you right in the face :).
Now, a way to improve your implementation would be to use local memory and this time a barrier at the local memory level (with the flag CLK_LOCAL_MEM_FENCE). The same principle I've just explained still applies but accessing local memory is much faster.
As far as I understand your code (quite likely I didn't get all the subtleties), your s array has N elements, and I guess that you also have N workers. So you create a local array of N elements and you do something like that:
kernel void part1(global int* s, global float* W, global float* Th, global float* V, global float* input, local int* local_s, int N, int T)
{
unsigned int i = get_global_id(0);
unsigned int local_i = get_local_id(0);
float value = 0;
float v = V[i];
float th = Th[i];
//fetch from global to local and sync before computing
local_s[local_i] = s[i];
barrier(CLK_LOCAL_MEM_FENCE);
for(int t = 0; t < T; t++){
value = v*input[t];
for(int j = 0; j < N; j++){
value = value + W[i*N + j]*local_s[j];
}
barrier(CLK_LOCAL_MEM_FENCE);
if (value >= th){
local_s[i] = 1;
} else {
local_s[i] = 0;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
//If necessary write some stuff to global (maybe the last s computed?)
}
Now I have to warn you:
I might have completely misunderstood your needs :)
I've just edited your code while typing this answer, so there are most probably typos and such.
Even using local memory, having so many barriers might still make the opencl version slower than the C one.
Note that I removed the leading __ since they are not necessary and it is easier to read in my opinion.
EDIT: Regarding your comment about CLK_LOCAL_MEM_FENCE vs. CLK_GLOBAL_MEM_FENCE. A barrier is always applied at the workgroup level, so all workers within a workgroup have to hit that barrier. The flag given as parameter refers to memory access. When the flag is CLK_GLOBAL_MEM_FENCE it means that every read/write operation regarding the global memory has to be completed by every worker before any worker can continue to run the next statements. This is exactly the same with the CLK_LOCAL_MEM_FENCE flag but for the local memory.

Resources