How can I read-modify-write the same variable from multiple GPU threads? In C++AMP I used the standard lib's compare-and-set function, but I haven't found an example in AleaGPU.
I know the goal is to avoid such things, but without getting into much detail I'll say its pretty necessary for my code.
There is an API in AleaGPU: http://www.aleagpu.com/release/3_0_3/api/html/64c9ca47-2e8e-265b-d968-15345e374320.htm
The usage is described here: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomiccas
Related
As I understand, MATLAB cannot use pass by reference when sending arguments to other functions. I am doing audio processing, and I frequently have to pass waveforms as arguments into functions, and because MATLAB uses pass by value for these arguments, it really eats up a lot of RAM when I do this.
I was considering using global variables as a method to pass my waveforms into functions, but everywhere I read there seems to be a general opinion that this is a bad idea, for organization of code, and potentially performance issues... but I haven't really read any detailed answers on how this might impact performance...
My question: What are the negative impacts of using global variables (with sizes > 100MB) to pass arguments to other functions in MATLAB, both in terms of 1) performance and 2) general code organization and good practice.
EDIT: From #Justin's answer below, it turns out MATLAB does on occasion use pass by reference when you do not modify the argument within the function! From this, I have a second related question about global variable performance:
Will using global variables be any slower than using pass by reference arguments to functions?
MATLAB does use pass by reference, but also uses copy-on-write. That is to say, your variable will be passed by reference into the function (and so won't double up on RAM), but if you change the variable within the the function, then MATLAB will create a copy and change the copy (leaving the original unaffected).
This fact doesn't seem to be too well known, but there's a good post on Loren's blog discussing it.
Bottom line: it sounds like you don't need to use global variables at all (which are a bad idea as #Adriaan says).
While relying on copy on write as Justin suggested is typically the best choice, you can easily implement pass by reference. With Matlab oop being nearly as fast as traditional functions in Matlab 2015b or newer, using handle is a reasonable option.
I encountered an interesting use case of a global variable yesterday. I tried to parallellise a piece of code (1200 lines, multiple functions inside the main function, not written by me), using parfor.
Some weird errors came out and it turned out that this piece of code wrote to a log file, but used multiple functions to write to the log file. Rather than opening and closing the relevant log file every time a function wanted to write to it, which is very slow, the file ID was made global, so that all write-functions could access it.
For the serial case this made perfect sense, but when trying to parallellise this, using global apparently breaks the scope of a worker instance as well. So suddenly we had 4 workers all trying to write into the same log file, which resulted in some weird errors.
So all in all, I maintain my position that using global variables is generally a bad idea, although I can see its use in specific cases, provided you know what you're doing.
Using global variables in Matlab may increase performance alot. This is because you can avoid copying of data in some cases.
Before attempting to gain such performance tweaks, think carefully of the cost to your project, in terms of the many drawbacks that global variables come with. There are also pitfalls to using globals with bad consequences to performance, and those may be difficult to avoid(although possible). Any code that is littered with globals tend to be difficult to comprehend.
If you want to see globals in use for performance, you can look at this real-time toolbox for optical flow that I made. This is the only project in native Matlab that is capable of real-time optical flow that I know of. Using globals was one of the reasons this was doable. It is also a reason to why the code is quite difficult to grasp: Globals are evil.
That globals can be used this way is not a way to argue for their use, rather it should be a hint that something should be updated with Matlabs unflexible notions of workspace and inefficient alternatives to globals such as guidata/getappdata/setappdata.
In OpenMP, when you do not specify any loop iteration policy (in the code pragmas or through environment variable OMP_SCHEDULE), the specs (section 2.3.2) clearly state that the default loop iteration policy is implementation-defined and implementations may or may not expose it.
Is there a workaround to get this policy ? To be explicit, I would like to get the value of the internal control variable def-sched-var defined in the specs.
I am using GCC 4.9 with OpenMP 4.0 on a POWER8 architecture.
First of all, I never saw any implementation whose default type of scheduling was something else than static, but this doesn't mean all of them do use static as default.
However, from your comment, I deduct that you want to establish a correlation between the performance of the code and the type of scheduling that is used.
You have the possibility of running the code with various types of scheduling (namely static, dynamic and guided). This will tell you how the performance varies as a function of the scheduling policy. Maybe this will tell you something right away, but I would try different things, such as looking at every parallel loop and measure its performance. Post the main loops so we can tell you if there is something else going on.
Put simply, I doubt that changing the type of scheduling will solve bad performance that fast.
I was reading code from one of the projects from github. I came across something called a Vectored Referencing buffer implementation. Can have someone come across this ? What are the practical applications of this. I did a quick google search and wasn't able to find any simple sample implementation for this.
Some insight would be helpful.
http://www.ibm.com/developerworks/library/j-zerocopy/
http://www.linuxjournal.com/article/6345
http://www.seccuris.com/documents/whitepapers/20070517-devsummit-zerocopybpf.pdf
https://github.com/joyent/node/pull/304
I think some more insight on your specific project/usage/etc would allow for a more specific answer.
However, the term is generally used to either change or start an interface/function/routine with the goal that it does not allocate another instance of its input in order to perform its operations.
EDIT: Ok, after reading the new title, I think you are simply talking about pushing buffers into a vector of buffers. This keeps your code clean, you can pass any buffer you need with minimal overhead to any function call, and allows for a better cleanup time if your code isn't managed.
EDIT 2: Do you mean this http://cpansearch.perl.org/src/TYPESTER/Data-MessagePack-Stream-0.07/msgpack-0.5.7/src/msgpack/vrefbuffer.h
Why do we need boost::thread_specific_ptr, or in other words what can we not easily do without it?
I can see why pthread provides pthread_getspecific() etc. These functions are useful for cleaning up after dead threads, and handy to call from C-style functions (the obvious alternative being to pass a pointer everywhere that points to some memory allocated before the thread was created).
In contrast, the constructor of boost:thread takes a callable class by value, and everything non-static in that class becomes thread local once it is copied. I cannot see why I would want to use boost::thread_specific_ptr in preference to a class member any more than I would want to use a global variable in OOP code.
Do I horribly misunderstand anything? A very brief example would help, please. Many thanks.
thread_specific_ptr simply provides portable thread local data access. You don't have to be managing your threads with Boost.Thread to get value from this. The canonical example is the one cited in the Boost docs for this class:
One example is the C errno variable,
used for storing the error code
related to functions from the Standard
C library. It is common practice (and
required by POSIX) for compilers that
support multi-threaded applications to
provide a separate instance of errno
for each thread, in order to avoid
different threads competing to read or
update the value.
Are there any languages with possibility of declaring global assertions - that is assertion that should hold during the whole program execution. So that it would be possible to write something like:
global assert (-10 < speed < 10);
and this assertion will be checked every time speed changes state?
eiffel supports all different contracts: precondition, postcondition, invariant... you may want to use that.
on the other hand, why do you have a global variable? why don't you create a class which modifies the speed. doing so, you can easily check your condition every time the value changes.
I'm not aware of any languages that truly do such a thing, and I would doubt that there exist any since it is something that is rather hard to implement and at the same time not something that a lot of people need.
It is often better to simply assert that the inputs are valid and modifications are only done when allowed and in a defined, sane way. This concludes the need of "global asserts".
You can get this effect "through the backdoor" in several ways, though none is truly elegant, and two are rather system-dependent:
If your language allows operator overloading (such as e.g. C++), you can make a class that overloads any operator which modifies the value. It is considerable work, but on the other hand trivial, to do the assertions in there.
On pretty much every system, you can change the protection of memory pages that belong to your process. You could put the variable (and any other variables that you want to assert) separately and set the page to readonly. This will cause a segmentation fault when the value is written to, which you can catch (and verify that the assertion is true). Windows even makes this explicitly available via "guard pages" (which are really only "readonly pages in disguise").
Most modern processors support hardware breakpoints. Unless your program is to run on some very exotic platform, you can exploit these to have more fine-grained control in a similar way as by tampering with protections. See for example this article on another site, which describes how to do it under Windows on x86. This solution will require you to write a kind of "mini-debugger" and implies that you may possibly run into trouble when running your program under a real debugger.