thrust: sorting within a threadblock

thrust: sorting within a threadblock - sorting

I am dispatching a kernel with around 5k blocks. At some point, we need to sort an array within each threadblock. If possible we would like to use a library like thrust.
From the documentation I understand that how sort is executed in thrust depends on the specified execution_policy. However I don't understand if I can use execution_policies to specify that I would like to use the threads of my current block for sorting. Can someone explain or hint me towards a good documentation of execution policies and tell me if what I intend to do is feasible?

Turns out that execution policies are basically a bridge design pattern that uses template specialization instead of inheritance to select the appropriate implementation of an algorithm while exposing a stable interface towards the user of the library and avoiding the overhead/necessity of virtual functions. Thank you robert-crovella for the great video.
As for the actual implementation of sorting within a threadblock in thrust, talonmies is right. There simply is no implementation (currently?), I could not find anything in the source code.

Related

Serializing a std::function for a distributed population-based optimization application in C++

Hopefully the title of this question conveys the general concept that I am looking for assistance in designing/implementing.
To clarify:
I am trying to write a distributed application to perform optimization. The general approach will follow that of a genetic algorithm: I am trying to find a vector of parameters which minimizes the value of some cost function.
My present difficulty deals with the cost function itself. I am trying to make the design as generic as I possibly can; therefore, my intent is to use a std::function as an argument to the optimizer class.
Specifically, my question centers around: how can I share information about this cost function to the distributed nodes?
A couple more details:
The computation nodes are intended to be fully decoupled. This type of algorithm is embarrassingly parallel and requires no communication between each of the nodes.
It has been my intention, therefore, not to use MPI (or related message-passing approaches), but rather to use a scheduler such as Condor or SGE (most likely the former) to spawn instances of the application on available nodes. The nodes will be given a lookup key for retrieving its candidate vector from a database.
I can use the Factory Pattern to instantiate objects of a given cost function type; however, I am looking for an alternate approach. (I'm not sure what it is about using a Factory class that I find disagreeable)
I am obviously not looking for a way to execute arbitrary code on a remote machine. Each node runs a full instance of the executable.
As an aside, I am making fairly extensive use of Boost throughout and am highly amenable to an alternative solution using Boost functionality.
I'd appreciate if anyone can offer some advice on how to distribute the cost function to other nodes. As well, I'd also greatly appreciate if anyone can point out whether there is something flawed with my general design concept.
Update:
Valid or not, one of my hesitations regarding a factory pattern for the cost function, involves the configuration/parameters/details of the cost function. There are certain assumptions and/or specs that I can assign to this family of classes: the objects that it is expecting to operate on, etc. I am already passing various options to other Factory classes elsewhere in the design and I find it adds to the complexity.
Perhaps a bit of additional clarification on the context is warranted:
I am using this optimizer as a module in an app for the purpose of modeling data. The details of the model are, in fact, parsed into an AST and passed to a Factory class which passes arguments to the applicable model class(es).
Therefore, the optimizer has a reference to the model itself, can call a Calculate(Iterator&) method on the model and expect a Matrix as a return value (fyi, the Iterator is a reference to the parameter Vector).
That part is straightforward. What I find to be less straightforward is then determining what to do with the returned matrix. Certainly, a simple cost function might simply be a least-squares. But I can conceive of relatively complex functions as well which may themselves be parameterized. I can deal with that, as well, but I was hoping that there might be a completely different approach -- one that I haven't thought of, which might help to simplify what I am doing.
Thanks and kind regards,
Shmuel

Is it possible to build a thread-safe linked list that scales and provides read, insert and remove?

I've been looking for implementations of scalable linked-lists that can be used safely my multiple threads in parallel, that can read, insert and remove elements from it.
However, I was only able to found papers and theoretical descriptions of them, no libraries are available.
First of all, it is mandatory to use intrinsics or can I do this with higher-level atomics (e.g. OpenMP atomic)? Will that hurt scalability?
In second place, implementations usually provide insert and remove functions. The only way to traverse the list in another thread and at the same time is me to implement it, right?
Thanks

"GLOBAL could be very inefficient"

I am using (in Matlab) a global statement inside an if command, so that I import the global variable into the local namespace only if it is really needed.
The code analyzer warns me that "global could be very inefficient unless it is a top-level statement in its function". Thinking about possible internal implementation, I find this restriction very strange and unusual. I am thinking about two possibilities:
What this warning really means is "global is very inefficient of its own, so don't use it in a loop". In particular, using it inside an if, like I'm doing, is perfectly safe, and the warning is issued wrongly (and poorly worded)
The warning is correct; Matlab uses some really unusual variable loading mechanism in the background, so it is really much slower to import global variables inside an if statement. In this case, I'd like to have a hint or a pointer to how this stuff really works, because I am interested and it seems to be important if I want to write efficient code in future.
Which one of these two explanations is correct? (or maybe neither is?)
Thanks in advance.
EDIT: to make it clearer: I know that global is slow (and apparently I can't avoid using it, as it is a design decision of an old library I am using); what I am asking is why the Matlab code analyzer complains about
if(foo==bar)
GLOBAL baz
baz=1;
else
do_other_stuff;
end
but not about
GLOBAL baz
if(foo==bar)
baz=1;
else
do_other_stuff;
end
I find it difficult to imagine a reason why the first should be slower than the second.

To supplement eykanals post, this technical note gives an explanation to why global is slow.
... when a function call involves global variables, performance is even more inhibited. This is because to look for global variables, MATLAB has to expand its search space to the outside of the current workspace. Furthermore, the reason a function call involving global variables appears a lot slower than the others is that MATLAB Accelerator does not optimize such a function call.

I do not know the answer, but I strongly suspect this has to do with how memory is allocated and shared at runtime.
Be that as it may, I recommend reading the following two entries on the Mathworks blogs by Loren and Doug:
Writing deployable code, the very first thing he writes in that post
Top 10 MATLAB code practices that make me cry, #2 on that list.
Long story short, global variables are almost never the way to go; there are many other ways to accomplish variable sharing - some of which she discusses - which are more efficient and less error-prone.

The answer from Walter Roberson here
http://mathworks.com/matlabcentral/answers/19316-global-could-be-very-inefficient#answer_25760
[...] This is not necessarily more work if not done in a top-level command, but people would tend to put the construct in a loop, or in multiple non-exclusive places in conditional structures. It is a lot easier for a person writing mlint warnings to not have to add clarifications like, "Unless you can prove those "global" will only be executed once, in which case it isn't less efficient but it is still bad form"
supports my option (1).

Fact(from Matlab 2014 up until Matlab 2016a, and not using parallell toolbox): often, the fastest code you can achieve with Matlab is by doing nested functions, sharing your variables between functions without passing them.
The step close to that, is using global variables, and splitting your project up into multiple files. This may pull down performance slightly, because (supposedly, although I have never seen it verified in any tests) Matlab incurs overhead by retrieving from the global workspace, and because there is some kind of problem (supposedly, although never seen any evidence of it) with the JIT acceleration.
Through my own testing, passing very large data matrices (hi-res images) between calls to functions, using nested functions or global variables are almost identical in performance.
The reason that you can get superior performance with global variables or nested functions, is because you can avoid having extra data copying that way. If you send a variable to function, Matlab does so by reference, but if you modify the variable in the function, Matlab makes a copy on the fly (copy-on-write). There is no way I know of to avoid that in Matlab, except by nested functions and global variables. Any small drain you get from hinderance to JIT or global fetch times, is totally gained by avoiding this extra data copying, (when using larger data).
This may have changed with never versions of Matlab, but from what i hear from friends, I doubt it. I cant submit any test, dont have a Matlab license anymore.
As proof, look no further then this toolbox of video processing i made back in the day I was working with Matlab. It is horribly ugly under the hood, because I had no way of getting performance without globals.
This fact about Matlab (that global variables is the most optimized way you can code when you need to modify large data in different functions), is an indication that the language and/or interpreter needs to be updated.
Instead, Matlab could use a better, more dynamic notion of workspace. But nothing I have seen indicates this will ever happen. Especially when you see the community of users seemingly ignore the facts, and push forward oppions without any basis: such as using globals in Matlab are slow.
They are not.
That said, you shouldnt use globals, ever. If you are forced to do real time video processing in pure Matlab, and you find you have no other option then using globals to reach performance, you should get the hint and change language. Its time to get into higher performance languages.... and also maybe write an occasional rant on stack overflow, in hopes that Matlab can get improved by swaying the oppinions of its users.

Reimplementing data structures in the real world

The topic of algorithms class today was reimplementing data structures, specifically ArrayList in Java. The fact that you can customize a structure for in various ways definitely got me interested, particularly with variations of add() & iterator.remove() methods.
But is reimplementing and customizing a data structure something that is of more interest to the academics vs the real-world programmers? Has anyone reimplemented their own version of a data structure in a commercial application/program, and why did you pick that route over your particular language's implementation?

Knowing how data structures are implemented and can be implemented is definitely of interest to everyone, not just academics. While you will most likely not reimplement a datastructure if the language already provides an implementation with suitable functions and performance characteristics, it is very possible that you will have to create your own data structure by composing other data structures... or you may need to implement a data structure with slightly different behavior than a well-known data structure. In that case, you certainly will need to know how the original data structure is implemented. Alternatively, you may end up needing a data structure that does not exist or which provides similar behavior to an existing data structure, but the way in which it is used requires that it be optimized for a different set of functions. Again, such a situation would require you to know how to implement (and alter) the data structure, so yes it is of interest.
Edit
I am not advocating that you reimplement existing datastructures! Don't do that. What I'm saying is that the knowledge does have practical application. For example, you may need to create a bidirectional map data structure (which you can implement by composing two unidirectional map data structures), or you may need to create a stack that keeps track of a variety of statistics (such as min, max, mean) by using an existing stack data structure with an element type that contains the value as well as these various statistics. These are some trivial examples of things that you might need to implement in the real world.

I have re-implemented some of a language's built-in data structures, functions, and classes on a number of occasions. As an embedded developer, the main reason I would do that is for speed or efficiency. The standard libraries and types were designed to be useful in a variety of situations, but there are many instances where I can create a more specialized version that is custom-tailored to take advantage of the features and limitations of my current platform. If the language doesn't provide a way to open up and modify existing classes (like you can in Ruby, for instance), then re-implementing the class/function/structure can be the only way to go.
For example, one system I worked on used a MIPS CPU that was speedy when working with 32-bit numbers but slower when working with smaller ones. I re-wrote several data structures and functions to use 32-bit integers instead of 16-bit integers, and also specified that the fields be aligned to 32-bit boundaries. The result was a noticable speed boost in a section of code that was bottlenecking other parts of the software.
That being said, it was not a trivial process. I ended up having to modify every function that used that structure and I ended up having to re-write several standard library functions as well. In this particular instance, the benefits outweighed the work. In the general case, however, it's usually not worth the trouble. There's a big potential for hard-to-debug problems, and it's almost always more work than it looks like. Unless you have specific requirements or restrictions that the existing structures/classes don't meet, I would recommend against re-implementing them.
As Michael mentions, it is indeed useful to know how to re-implement structures even if you never do so. You may find a problem in the future that can be solved by applying the principles and techniques used in existing data structures.

Does coding towards an interface rather then an implementation imply a performance hit?

In day to day programs I wouldn't even bother thinking about the possible performance hit for coding against interfaces rather than implementations. The advantages largely outweigh the cost. So please no generic advice on good OOP.
Nevertheless in this post, the designer of the XNA (game) platform gives as his main argument to not have designed his framework's core classes against an interface that it would imply a performance hit. Seeing it is in the context of a game development where every fps possibly counts, I think it is a valid question to ask yourself.
Does anybody have any stats on that? I don't see a good way to test/measure this as don't know what implications I should bear in mind with such a game (graphics) object.

Coding to an interface is always going to be easier, simply because interfaces, if done right, are much simpler. Its palpably easier to write a correct program using an interface.
And as the old maxim goes, its easier to make a correct program run fast than to make a fast program run correctly.
So program to the interface, get everything working and then do some profiling to help you meet whatever performance requirements you may have.

What Things Cost in Managed Code
"There does not appear to be a significant difference in the raw cost of a static call, instance call, virtual call, or interface call."
It depends on how much of your code gets inlined or not at compile time, which can increase performance ~5x.
It also takes longer to code to interfaces, because you have to code the contract(interface) and then the concrete implementation.
But doing things the right way always takes longer.

First I'd say that the common conception is that programmers time is usually more important, and working against implementation will probably force much more work when the implementation changes.
Second with proper compiler/Jit I would assume that working with interface takes a ridiculously small amount of extra time compared to working against the implementation itself.
Moreover, techniques like templates can remove the interface code from running.
Third to quote Knuth : "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil."
So I'd suggest coding well first, and only if you are sure that there is a problem with the Interface, only then I would consider changing.
Also I would assume that if this performance hit was true, most games wouldn't have used an OOP approach with C++, but this is not the case, this Article elaborates a bit about it.
It's hard to talk about tests in a general form, naturally a bad program may spend a lot of time on bad interfaces, but I doubt if this is true for all programs, so you really should look at each particular program.

Interfaces generally imply a few hits to performance (this however may change depending on the language/runtime used):
Interface methods are usually implemented via a virtual call by the compiler. As another user points out, these can not be inlined by the compiler so you lose that potential gain. Additionally, they add a few instructions (jumps and memory access) at a minimum to get the proper PC in the code segment.
Interfaces, in a number of languages, also imply a graph and require a DAG (directed acyclic graph) to properly manage memory. In various languages/runtimes you can actually get a memory 'leak' in the managed environment by having a cyclic graph. This imposes great stress (obviously) on the garbage collector/memory in the system. Watch out for cyclic graphs!
Some languages use a COM style interface as their underlying interface, automatically calling AddRef/Release whenever the interface is assigned to a local, or passed by value to a function (used for life cycle management). These AddRef/Release calls can add up and be quite costly. Some languages have accounted for this and may allow you to pass an interface as 'const' which will not generate the AddRef/Release pair automatically cutting down on these calls.
Here is a small example of a cyclic graph where 2 interfaces reference each other and neither will automatically be collected as their refcounts will always be greater than 1.
interface Parent {
Child c;
}
interface Child {
Parent p;
}
function createGraph() {
...
Parent p = ParentFactory::CreateParent();
Child c = ChildFactory::CreateChild();
p.c = c;
c.p = p;
... // do stuff here
// p has a reference to c and c has a reference to p.
// When the function goes out of scope and attempts to clean up the locals
// it will note that p has a refcount of 1 and c has a refcount of 1 so neither
// can be cleaned up (of course, this is depending on the language/runtime and
// if DAGS are allowed for interfaces). If you were to set c.p = null or
// p.c = null then the 2 interfaces will be released when the scope is cleaned up.
}

I think object lifetime and the number of instances you're creating will provide a coarse-grain answer.
If you're talking about something which will have thousands of instances, with short lifetimes, I would guess that's probably better done with a struct rather than a class, let alone a class implementing an interface.
For something more component-like, with low numbers of instances and moderate-to-long lifetime, I can't imagine it's going to make much difference.

IMO yes, but for a fundamental design reason far more subtle and complex than virtual dispatch or COM-like interface queries or object metadata required for runtime type information or anything like that. There is overhead associated with all of that but it depends a lot on the language and compiler(s) used, and also depends on whether the optimizer can eliminate such overhead at compile-time or link-time. Yet in my opinion there's a broader conceptual reason why coding to an interface implies (not guarantees) a performance hit:
Coding to an interface implies that there is a barrier between you and
the concrete data/memory you want to access and transform.
This is the primary reason I see. As a very simple example, let's say you have an abstract image interface. It fully abstracts away its concrete details like its pixel format. The problem here is that often the most efficient image operations need those concrete details. We can't implement our custom image filter with efficient SIMD instructions, for example, if we had to getPixel one at a time and setPixel one at a time and while oblivious to the underlying pixel format.
Of course the abstract image could try to provide all these operations, and those operations could be implemented very efficiently since they have access to the private, internal details of the concrete image which implements that interface, but that only holds up as long as the image interface provides everything the client would ever want to do with an image.
Often at some point an interface cannot hope to provide every function imaginable to the entire world, and so such interfaces, when faced with performance-critical concerns while simultaneously needing to fulfill a wide range of needs, will often leak their concrete details. The abstract image might still provide, say, a pointer to its underlying pixels through a pixels() method which largely defeats a lot of the purpose of coding to an interface, but often becomes a necessity in the most performance-critical areas.
Just in general a lot of the most efficient code often has to be written against very concrete details at some level, like code written specifically for single-precision floating-point, code written specifically for 32-bit RGBA images, code written specifically for GPU, specifically for AVX-512, specifically for mobile hardware, etc. So there's a fundamental barrier, at least with the tools we have so far, where we cannot abstract that all away and just code to an interface without an implied penalty.
Of course our lives would become so much easier if we could just write code, oblivious to all such concrete details like whether we're dealing with 32-bit SPFP or 64-bit DPFP, whether we're writing shaders on a limited mobile device or a high-end desktop, and have all of it be the most competitively efficient code out there. But we're far from that stage. Our current tools still often require us to write our performance-critical code against concrete details.
And lastly this is kind of an issue of granularity. Naturally if we have to work with things on a pixel-by-pixel basis, then any attempts to abstract away concrete details of a pixel could lead to a major performance penalty. But if we're expressing things at the image level like, "alpha blend these two images together", that could be a very negligible cost even if there's virtual dispatch overhead and so forth. So as we work towards higher-level code, often any implied performance penalty of coding to an interface diminishes to a point of becoming completely trivial. But there's always that need for the low-level code which does do things like process things on a pixel-by-pixel basis, looping through millions of them many times per frame, and there the cost of coding to an interface can carry a pretty substantial penalty, if only because it's hiding the concrete details necessary to write the most efficient implementation.

In my personal opinion, all the really heavy lifting when it comes to graphics is passed on to the GPU anwyay. These frees up your CPU to do other things like program flow and logic. I am not sure if there is a performance hit when programming to an interface but thinking about the nature of games, they are not something that needs to be extendable. Maybe certain classes but on the whole I wouldn't think that a game needs to programmed with extensibility in mind. So go ahead, code the implementation.

it would imply a performance hit
The designer should be able to prove his opinion.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio