Open CL Kernel - every workitem overwrites global memory? - parallel-processing

I'm trying to write a kernel to get the character frequencies of a string.
First, here is the code I have for kernel right now:
_kernel void readParallel(__global char * indata, __global int * outdata)
{
int startId = get_global_id(0) * 8;
int maxId = startId + 7;
for (int i = startId; i < maxId; i++)
{
++outdata[indata[i]];
}
}
The variable inData holds the string in the global memory, and outdata is an array of 256 int values in the global memory. Every workitem reads 8 symbols from the string and should increase the appropriate ASCII-code in the array. The code compiles and executes, but outdata contains less occurrences overall than the number of characters in inData. I think the problem is that workitems overwrites the global memory. It would be nice if you can give me some tips to solve this.
By the way,. I am a rookie in OpenCL ;-) and, yes, I looked for solutions in other questions.

You are experiencing the effects of your uses of global memory not being atomic (C++-oriented description of what those are or another description by the Intel TBB folks). What happens, chronologically, is:
Some workgroup "thread" loads outData[123] into some register r1
... lots of work, reading and writing, happens, including on
outData[123]...
The same workgroup "thread" increments r1
... lots of work, reading and writing, happens, including on
outData[123]...
The same workgroup "thread" writes r1 to outData[123]
So, the value written to outData[123] "throws away" the updates during the time period between the read and the write (I'm ignoring the possibility of parallel writes corrupting each other rather than one of them winning out).
What you need to do is either:
Use atomic operations - the least amount of modifications to your code, but very inefficient, since it serializes your work to a great extent, or
Use work-item-specific, warp-specific and/or work-group-specific partial results, which require less/cheaper synchronization, and combine them eventually after having done a lot of work on them.
On an unrelated note, and as #huseyintugrulbuyukisik correctly points out, your code uses signed char values to index the array. To fix that, do one of the following:
reinterpret those char's as unsigned chars for array indices (and reinterpret back when reading the array.
upcast the char values to a larger integral type and add 128 to get an offset into the outArray.
Define your kernel to only support ASCII characters (no higher than 127), in which case you can ignore this issue (although that will be a potential crasher if you get invalid input.
If you only care about the frequency of printable characters (but can also have non-printing characters in the input), you could perform a run-time check before counting a character.

Related

Purpose of _Compiler_barrier() on 32bit read

I have been stepping through the function calls that are involved when I assign to an atomic_long type on VS2017 with a 64bit project. I specifically wanted to see what happens when I copy an atomic_long into a none-atomic variable, and if there is any locking around it.
atomic_long ll = 10;
long t2 = ll;
Ultimately it ends up with this call (I've removed some code that was ifdefed out)
inline _Uint4_t _Load_seq_cst_4(volatile _Uint4_t *_Tgt)
{ /* load from *_Tgt atomically with
sequentially consistent memory order */
_Uint4_t _Value;
_Value = *_Tgt;
_Compiler_barrier();
return (_Value);
}
Now, I've read from MSDN that a plain read of a 32bit value will be atomic:
Simple reads and writes to properly-aligned 32-bit variables are
atomic operations.
...which explains why there is no Interlocked function for just reading; only those for changing/comparing. What I'd like to know is what the _Compiler_barrier() bit is doing. This is #defined as
__MACHINE(void _ReadWriteBarrier(void))
...and I've found on MSDN again that this
Limits the compiler optimizations that can reorder memory accesses
across the point of the call.
But I don't get this, as there are no other memory accesses apart from the return call; surely the compiler wouldn't move the assignment below that would it?
Can someone please clarify the purpose of this barrier?
_Load_seq_cst_4 is an inline function. The compiler barrier is there to block reordering with later code in the calling function this inlines into.
For example, consider reading a SeqLock. (Over-simplified from this actual implementation).
#include <atomic>
atomic<unsigned> sequence;
atomic_long value;
long seqlock_try_read() {
// this would normally be the body of a retry-loop;
unsigned seq1 = sequence;
long tmpval = value;
unsigned seq2 = sequence;
if (seq1 == seq2 && (seq1 & 1 == 0)
return tmpval;
else
// writer was modifying it, we should retry the loop
}
If we didn't block compile-time reordering, the compiler could merge both reads of sequence into a single access, like perhaps like this
long tmpval = value;
unsigned seq1 = sequence;
unsigned seq2 = sequence;
This would defeat the locking mechanism (where the writer increments sequence once before modifying the data, then again when it's done). Readers are entirely lockless, but it's not a "lock-free" algo because if the writer gets stuck mid-update, the readers can't read anything.
The barrier within each load function blocks reordering with other things after inlining.
(The C++11 memory model is very weak, but the x86 memory model is strong, only allowing StoreLoad reordering. Blocking compile-time reordering with later loads/stores is sufficient to give you an acquire / sequential-consistency load at runtime. x86: Are memory barriers needed here?)
BTW, a better example might be something where some non-atomic variables are read/written after seeing a certain value in an atomic flag. MSVC probably already avoids reordering or merging of atomic accesses, and in the seqlock the data being protected also has to be atomic.
Why don't compilers merge redundant std::atomic writes?

Checking for valid user memory in kernel mode with copy_to_user

So, I tried using this:
copy_to_user(p, q, 0)
I want to copy from q to p and if it doesn't work, then I want to know if p points to an invalid address.
copy_to_user returns the number of bytes that weren't copied successfully but in this case, there are 0 bytes and I can't know for sure if p points to an invalid address.
Is there another way to check if p points to a valid user memory?
Yes. You need to check passing size value manually each time before calling copy_to_user(). If it's 0 or not in valid range -- you shouldn't call copy_to_user() at all. This way you can rely on copy_to_user() return value.
the method copy_to_user defined at /usr/src/linux-3.0.6-gentoo/include/asm-generic/uaccess.h
static inline long copy_to_user(void __user *to,
const void *from, unsigned long n)
{
might_fault();
if (access_ok(VERIFY_WRITE, to, n))
return __copy_to_user(to, from, n);
else
return n;
}
the method access_ok checks the accessibility of to(user memory). So you can use the method access_ok to check memory is valid or not(to is not NULL / it's in user space)?
Argument VERIFY_READ or VERIFY_WRITE. VERIFY_READ: identifies whether memory region is readable, VERIFY_WRITE: identifies whether the memory region is readable as well as writable.
source of method access_ok
And what do you consider 'valid user memory'? What do you need this for?
Let's say we only care about the target buffer residing in userspace range (for archs with joint address spaces). From this alone we see that testing the address without the size is pointless - what if the address is the last byte of userspace? Appropriate /range/ check is done by access_ok.
Second part is whether there is a page there or a read/write can be performed without servicing a page fault. Is this of any concern for you? If you read copy_from/whatever you will see it performs the read/write and only catches the fault. There is definitely KPI to check whether the target page can be written to without a fault, but you would need to hold locks (mmap_sem and likely more) over your check and whatever you are going to do next, which is likely not what you wanted to do.
So far it seems you are trying

std::copy runtime_error when working with uint16_t's

I'm looking for input as to why this breaks. See the addendum for contextual information, but I don't really think it is relevant.
I have an std::vector<uint16_t> depth_buffer that is initialized to have 640*480 elements. This means that the total space it takes up is 640*480*sizeof(uint16_t) = 614400.
The code that breaks:
void Kinect360::DepthCallback(void* _depth, uint32_t timestamp) {
lock_guard<mutex> depth_data_lock(depth_mutex);
uint16_t* depth = static_cast<uint16_t*>(_depth);
std::copy(depth, depth + depthBufferSize(), depth_buffer.begin());/// the error
new_depth_frame = true;
}
where depthBufferSize() will return 614400 (I've verified this multiple times).
My understanding of std::copy(first, amount, out) is that first specifies the memory address to start copying from, amount is how far in bytes to copy until, and out is the memory address to start copying to.
Of course, it can be done manually with something like
#pragma unroll
for(auto i = 0; i < 640*480; ++i) depth_buffer[i] = depth[i];
instead of the call to std::copy, but I'm really confused as to why std::copy fails here. Any thoughts???
Addendum: the context is that I am writing a derived class that inherits from FreenectDevice to work with a Kinect 360. Officially the error is a Bus Error, but I'm almost certain this is because libfreenect interprets an error in the DepthCallback as a Bus Error. Stepping through with lldb, it's a standard runtime_error being thrown from std::copy. If I manually enter depth + 614400 it will crash, though if I have depth + (640*480) it will chug along. At this stage I am not doing something meaningful with the depth data (rendering the raw depth appropriately with OpenGL is a separate issue xD), so it is hard to tell if everything got copied, or just a portion. That said, I'm almost positive it doesn't grab it all.
Contrasted with the corresponding VideoCallback and the call inside of copy(video, video + videoBufferSize(), video_buffer.begin()), I don't see why the above would crash. If my understanding of std::copy were wrong, this should crash too since videoBufferSize() is going to return 640*480*3*sizeof(uint8_t) = 640*480*3 = 921600. The *3 is from the fact that we have 3 uint8_t's per pixel, RGB (no A). The VideoCallback works swimmingly, as verified with OpenGL (and the fact that it's essentially identical to the samples provided with libfreenect...). FYI none of the samples I have found actually work with the raw depth data directly, all of them colorize the depth and use an std::vector<uint8_t> with RGB channels, which does not suit my needs for this project.
I'm happy to just ignore it and move on in some senses because I can get it to work, but I'm really quite perplexed as to why this breaks. Thanks for any thoughts!
The way std::copy works is that you provide start and end points of your input sequence and the location to begin copying to. The end point that you're providing is off the end of your sequence, because your depthBufferSize function is giving an offset in bytes, rather than the number of elements in your sequence.
If you remove the multiply by sizeof(uint16_t), it will work. At that point, you might also consider calling std::copy_n instead, which takes the number of elements to copy.
Edit: I just realised that I didn't answer the question directly.
Based on my understanding of std::copy, it shouldn't be throwing exceptions with the input you're giving it. The only thing in that code that could throw a runtime_error is the locking of the mutex.
Considering you have undefined behaviour as a result of running off of the end of your buffer, I'm tempted to say that has something to do with it.

Atomic operations in ARM

I've been working on an embedded OS for ARM, However there are a few things i didn't understand about the architecture even after referring to ARMARM and linux source.
Atomic operations.
ARM ARM says that Load and Store instructions are atomic and it's execution is guaranteed to be complete before interrupt handler executes. Verified by looking at
arch/arm/include/asm/atomic.h :
#define atomic_read(v) (*(volatile int *)&(v)->counter)
#define atomic_set(v,i) (((v)->counter) = (i))
However, the problem comes in when i want to manipulate this value atomically using the cpu instructions (atomic_inc, atomic_dec, atomic_cmpxchg etc..) which use LDREX and STREX for ARMv7 (my target).
ARMARM doesn't say anything about interrupts being blocked in this section so i assume an interrupt can occur in between the LDREX and STREX. The thing it does mention is about locking the memory bus which i guess is only helpful for MP systems where there can be more CPUs trying to access same location at same time. But for UP (and possibly MP), If a timer interrupt (or IPI for SMP) fires in this small window of LDREX and STREX, Exception handler executes possibly changes cpu context and returns to the new task, however the shocking part comes in now, it executes 'CLREX' and hence removing any exclusive lock held by previous thread. So how better is using LDREX and STREX than LDR and STR for atomicity on a UP system ?
I did read something about an Exclusive lock monitor, so I've a possible theory that when the thread resumes and executes the STREX, the os monitor causes this call to fail which can be detected and the loop can be re-executed using the new value in the process (branch back to LDREX), Am i right here ?
The idea behind the load-linked/store-exclusive paradigm is that if if the store follows very soon after the load, with no intervening memory operations, and if nothing else has touched the location, the store is likely to succeed, but if something else has touched the location the store is certain to fail. There is no guarantee that stores will not sometimes fail for no apparent reason; if the time between load and store is kept to a minimum, however, and there are no memory accesses between them, a loop like:
do
{
new_value = __LDREXW(dest) + 1;
} while (__STREXW(new_value, dest));
can generally be relied upon to succeed within a few attempts. If computing the new value based on the old value required some significant computation, one should rewrite the loop as:
do
{
old_value = *dest;
new_value = complicated_function(old_value);
} while (CompareAndStore(dest, new_value, old_value) != 0);
... Assuming CompareAndStore is something like:
uint32_t CompareAndStore(uint32_t *dest, uint32_t new_value, uint_32 old_value)
{
do
{
if (__LDREXW(dest) != old_value) return 1; // Failure
} while(__STREXW(new_value, dest);
return 0;
}
This code will have to rerun its main loop if something changes *dest while the new value is being computed, but only the small loop will need to be rerun if the __STREXW fails for some other reason [which is hopefully not too likely, given that there will only be about two instructions between the __LDREXW and __STREXW]
Addendum
An example of a situation where "compute new value based on old" could be complicated would be one where the "values" are effectively a references to a complex data structure. Code may fetch the old reference, derive a new data structure from the old, and then update the reference. This pattern comes up much more often in garbage-collected frameworks than in "bare metal" programming, but there are a variety of ways it can come up even when programming bare metal. Normal malloc/calloc allocators are not generally thread-safe/interrupt-safe, but allocators for fixed-size structures often are. If one has a "pool" of some power-of-two number of data structures (say 255), one could use something like:
#define FOO_POOL_SIZE_SHIFT 8
#define FOO_POOL_SIZE (1 << FOO_POOL_SIZE_SHIFT)
#define FOO_POOL_SIZE_MASK (FOO_POOL_SIZE-1)
void do_update(void)
{
// The foo_pool_alloc() method should return a slot number in the lower bits and
// some sort of counter value in the upper bits so that once some particular
// uint32_t value is returned, that same value will not be returned again unless
// there are at least (UINT_MAX)/(FOO_POOL_SIZE) intervening allocations (to avoid
// the possibility that while one task is performing its update, a second task
// changes the thing to a new one and releases the old one, and a third task gets
// given the newly-freed item and changes the thing to that, such that from the
// point of view of the first task, the thing never changed.)
uint32_t new_thing = foo_pool_alloc();
uint32_t old_thing;
do
{
// Capture old reference
old_thing = foo_current_thing;
// Compute new thing based on old one
update_thing(&foo_pool[new_thing & FOO_POOL_SIZE_MASK],
&foo_pool[old_thing & FOO_POOL_SIZE_MASK);
} while(CompareAndSwap(&foo_current_thing, new_thing, old_thing) != 0);
foo_pool_free(old_thing);
}
If there will not often be multiple threads/interrupts/whatever trying to update the same thing at the same time, this approach should allow updates to be performed safely. If a priority relationship will exist among the things that may try to update the same item, the highest-priority one is guaranteed to succeed on its first attempt, the next-highest-priority one will succeed on any attempt that isn't preempted by the highest-priority one, etc. If one was using locking, the highest-priority task that wanted to perform the update would have to wait for the lower-priority update to finish; using the CompareAndSwap paradigm, the highest-priority task will be unaffected by the lower one (but will cause the lower one to have to do wasted work).
Okay, got the answer from their website.
If a context switch schedules out a process after the process has performed a Load-Exclusive but before it performs the Store-Exclusive, the Store-Exclusive returns a false negative result when the process resumes, and memory is not updated. This does not affect program functionality, because the process can retry the operation immediately.

How do you allocate memory at a predetermined location?

How do i allocate memory using new at a fixed location? My book says to do this:
char *buf=new char[sizeof(sample)];
sample *p=new(buf)sample(10,20);
Here new is allocating memory at buf's address, and (10,20) are values being passed. But what is sample? is it an address or a data type?
let me explain this code to you...
char *buf=new char[sizeof(sample)];
sample *p=new(buf)sample(10,20);
This is really four lines of code, written as two for your convenience. Let me just expand them
char *buf; // 1
buf = new char[sizeof(sample)]; // 2
sample *p; // 3
p = new(buf)sample(10,20); // 4
Line 1 and 3 are simple to explain, they are both declaring pointers. buf is a pointer to a char, p is a pointer to a sample. Now, we can not see what sample is, but we can assume that it is either a class defined else where, or some of data type that has been typedefed (more or less just given a new name) but either way, sample can be thought as a data type just link int or string
Line 2 is a allocating a block of memory and assigning it our char pointer called buf. Lets say sample was a class that contains 2 ints, this means it is (under most compilers) going to be 8 bytes (4 per int). And so buf points to the start of a block of memory that has been set aside to hold chars.
Line 4 is where it gets a big complex. if it where just p = new sample(10,20) it would be a simple case of creating a new object of type sample, passing it the two ints and storing the address of this new object in the pointer p. The addition of the (buf) is basically telling new to make use of the memory pointed to by buf.
The end effect is, you have one block of memory allocated (more or less 8 bytes) and it has two pointers pointing to it. One of the points, buf, is looking at that memory as 8 chars, that other, p, is looking at is a single sample.
Why would you do this?
Normally, you wouldn't. Modern C++ has made the sue of new rather redundant, there are many better ways to deal with objects. I guess that main reason for using this method, is if for some reason you want to keep a pool of memory allocated, as it can take time to get large blocks of memory and you might be able to save your self some time.
For the most part, if you think you need to do something like this, you are trying to solve the wrong thing
A Bit Extra
I do not have much experience with embedded or mobile devices, but I have never seen this used.
The code you posted is basically the same as just doing sample *p = new sample(10,20) neither method is controlling where the sample object is created.
Also consider that you do not always need to create objects dynamically using new.
void myFunction(){
sample p = sample(10,20);
}
This automatically creates a sample object for you. This method is much more preferable as it is easier to read and understand and you do not need to worry about deleting the object, it will be cleaned up for you when the function returns.
If you really do need to make use of dynamic objects, consider use of smart pointers, something like unique_ptr<sample> This will give you the ability to use dynamic object creation but save you the hassle of manual deleting the object of type sample (I can point you towards more info on this if you life)
It is a datatype or a typedef.

Resources