I am looking into the documentation of the Spliterator and according to it, the Spliterator is not thread-safe:
Despite their obvious utility in parallel algorithms, spliterators are not expected to be thread-safe; instead, implementations of parallel algorithms using spliterators should ensure that the spliterator is only used by one thread at a time. This is generally easy to attain via serial thread-confinement, which often is a natural consequence of typical parallel algorithms that work by recursive decomposition.
But, in its further documentation, which states a contradictory statement to the above statement:
Structural interference of a source can be managed in the following ways (in approximate order of decreasing desirability):
The source manages concurrent modifications.
For example, a key set of a java.util.concurrent.ConcurrentHashMap is a concurrent source. A Spliterator created from the source reports a characteristic of CONCURRENT.
So does that mean a Spliterator generated from a thread-safe collection would be thread-safe? Is it right?
No, a Spliterator reporting the CONCURRENT characteristic will have a thread safe source, which implies that it can iterate over it safely even when the source gets modified concurrently. But the Spliterator itself may still have state that must not be manipulated concurrently.
Note that your cite stems from a description of how “structural interference of a source can be managed”, not about the spliterator’s behavior in general.
This is also provided at the documentation of the CONCURRENT characteristic itself:
Characteristic value signifying that the element source may be safely concurrently modified (allowing additions, replacements, and/or removals) by multiple threads without external synchronization. If so, the Spliterator is expected to have a documented policy concerning the impact of modifications during traversal.
Nothing else.
So the consequences of these characteristics are astonishing small. A Spliterator reporting either CONCURRENT or IMMUTABLE will never throw a ConcurrentModificationException, that’s all. In all other regards, the differences between these characteristics will not be recognized by the Stream API, as the Stream API never performs any source manipulations, in fact, it doesn’t actually know the source (other than indirectly through the Spliterator), so it couldn’t do such manipulations nor detect whether a concurrent modification has happened.
Related
As we know, work items running on GPUs could diverge when there are conditional branches. One of those mentions exist in Apple's OpenCL Programming Guide for Mac.
As such, some portions of an algorithm may run "single-threaded", having only 1 work item running. And when it's especially serial and long-running, some applications take those work back to CPU.
However, this question concerns only GPU and assume those portions are short-lived. Do these "single-threaded" portions also diverge (as in execute both true and false code paths) when they have conditional branches? Or will the compute units (or processing elements, whichever your terminology prefers) skip those false branches?
Update
In reply to comment, I'd remove the OpenCL tag and leave the Vulkan tag there.
I included OpenCL as I wanted to know if there's any difference at all between clEnqueueTask and clEnqueueNDRangeKernel with dim=1:x=1. The document says they're equivalent but I was skeptical.
I believe Vulkan removed the special function to enqueue a single-threaded task for good reasons, and if I'm wrong, please correct me.
Do these "single-threaded" portions also diverge (as in execute both true and false code paths) when they have conditional branches?
From an API point of view it has to appear to the program that only the active branch paths were taken. As to what actually happens, I suspect you'll never know for sure. GPU hardware architectures are nearly all confidential so it's impossible to be certain.
There are really two cases here:
Cases where a branch in the program turns into a real branch instruction.
Cases where a branch in the program turns into a conditional select between two computed values.
In the case of a real branch I would expect most cases to only execute the active path because it's a horrible waste of power to do both, and GPUs are all about energy efficiency. That said, YMMV and this isn't guaranteed at all.
For simple branches the compiler might choose to use a conditional select (compute both results, and then select the right answer). In this case you will compute both results. The compiler heuristics will generally aim to choose this where computing both results is less expensive than actually having a full branch.
I included OpenCL as I wanted to know if there's any difference at all between clEnqueueTask and clEnqueueNDRangeKernel with dim=1:x=1. The document says they're equivalent but I was skeptical.
Why would they be different? They are doing the same thing conceptually ...
I believe Vulkan removed the special function to enqueue a single-threaded task for good reasons, and if I'm wrong, please correct me.
Vulkan compute dispatch is in general a whole load simpler than OpenCL (and also perfectly adequate for most use cases), so many of the host-side functions from OpenCL have no equivalent in Vulkan. The GPU side behavior is pretty much the same. It's also worth noting that most of the holes where Vulkan shaders are missing features compared to OpenCL are being patched up with extensions - e.g. VK_KHR_shader_float16_int8 and VK_KHR_variable_pointers.
Q : Or will the compute units skip those false branches?
The ecosystem of CPU / GPU code-execution is rather complex.
The layer of hardware is where the code-paths (translated into "machine"-code) operate. On this laye, the SIMD-Computing-Units cannot and will not skip anything they are ordered to SIMD-process by the hardware-scheduler (next layer).
The layer of hardware-specific scheduler (GPUs have typically right two-modes: a WARP-mode scheduling for coherent, non-diverging code-paths efficiently scheduled in SIMD-blocks and greedy-mode scheduling). From this layer, the SIMD-Computing-Units are loaded to work on SIMD-operated blocks-of-work, so any first divergence detected on the lower layer (above) breaks the execution, flags the SIMD-hardware scheduler about blocks, deferred to be executed later and all known SIMD-specific block-device-optimised scheduling is well-known to start to grow less-efficient and less-efficient, due to each such run-time divergence.
The layer of { OpenCL | Vulkan API }-mediated device-specific programming decides a lot about the ease or comfort of human-side programming of the wide range of the target-devices, all without knowing about its respective internal constraints, about (compiler decided) preferred "machine"-code computing problem re-formulation and device-specific tricks and scheduling. A bit oversimplified battlefield picture has made for years human-users just stay "in front" of the mediated asynchronous work-units ( kernel's ) HOST-to-DEVICE scheduling queues and wait until we receive back the DEVICE-to-HOST delivered results back, doing some prior-H2D/posterior-D2H memory transfers, if allowed and needed.
The HOST-side DEVICE-kernel-code "scheduling" directives are rather imperative and help the mediated-device-specific programming reflect user-side preferences, yet leave user blind from seeing all internal decisions ( assembly-level reviews are indeed only for hard-core, DEVICE-specific, GPU-engineering Aces and hard to modify, if willing to )
All that said, "adaptive" run-time values' based decisions to move a particular "work-unit" back-to-the-HOST-CPU, rather than finalising it all in DEVICE-GPU, are not, to the best of my knowledge, taking place on the bottom of this complex computing ecosystem hierarchy ( afaik, it would be exhaustively expensive to try to do so ).
A year ago I programmed a chess AI using the Alphabeta prunning algorithm. This was relatively straight forward to do in c++. One of the main issues I considered while doing this was making my code efficient. I did this by using having a data type I called a "game" that I passed around through the search tree made by the algorithm. To increase efficiency I didn't ever copy the "game" data type but rather mutated it while keeping the nessisary information needed to return it to its previous states.
Recently I have been reading about functional programming and the concept of purely using functions that do not change the state of the parameters they are passed appeals to me. I am wondering how I would using the paradigm of functional programming while still taking efficiency of the program into account.
In OOP the solution seems quite straight forward (which is what I implemented) while in functional programming it seems that copying data types is nessisary which decreases efficiency. Is it possible to use functional programming without this loss of efficiency?
In functional programming, data structures are not always copied completely. In many cases, only the part that changes needs to be copied, while the old parts can be referenced (since no mutation is allowed, this is safe).
The article on persistant data structures describes this in more detail.
Jephron's answer points out the important fact that only small parts of a persistent data structure need to get updated, thus the bigger part is shared between the old state and the new state.
To be honest, this would still be slower than a mutation in most cases.
But immutable, persistent data structures have other advantages. Let's assume you have already completed the playing engine. And now, you want to implement a history (for example to allow the player to undo earlier moves). This is dead simple: Just remember all states in a list. You'll find that you need to touch only a few functions to take a list of states instead of just the last state, and you're done. You don't need to worry about compromising your game engine --- there is no global variable or something you could destroy.
Another thing is taking advantage of the many CPU cores you probably have by employing parallelism. Needless to say that you can't let many tasks, threads, fibers or whatever operate on a single mutable data structure. This would just become a synchronization nightmare, and your code would probably go slower even. However, there simply are no synchronization problems on immutable data, as they are read only for all threads.
This could very well speed up your code in such a way that it dwarfs the C++ solution, even if "doing a move" on a functional data structure is much slower than on mutable data.
Here is an example for changing a board game (TTT) from single threaded to parallel: https://dierk.gitbooks.io/fregegoodness/content/src/docs/asciidoc/incremental_episode4.html
I've been looking for implementations of scalable linked-lists that can be used safely my multiple threads in parallel, that can read, insert and remove elements from it.
However, I was only able to found papers and theoretical descriptions of them, no libraries are available.
First of all, it is mandatory to use intrinsics or can I do this with higher-level atomics (e.g. OpenMP atomic)? Will that hurt scalability?
In second place, implementations usually provide insert and remove functions. The only way to traverse the list in another thread and at the same time is me to implement it, right?
Thanks
What challenges in shared memory parallel programming (particularly multicore) cannot be solved or cannot be solved efficiently using a Cilk-style solution (i.e. nested data parallelism with per-core work stealing task deques)?
I think Cilk's model is nested task parallelism (which you can use to implement
data parallelism). It is pretty cute, but...
Cilk doesn't support SIMD data parallelism or streaming parallelism.
Cilk doesn't appear to handle partial orders over tasks well, because it only offers nested parallelism. Try coding the following set of parallel tasks: A,B,C,D with ordering constraints A before B, A before D, C before D. (This is the canonical example of the smallest partial task order that nested task parallelism can't encode directly). You lose some the parallelism implementing this with nested parallelism. Parallelism being precious, you don't want to waste opportunities to be parallel.
It doesn't handle (AFAIK) exception handling across thread boundaries.
This is needed if you want to build a really big, complex symbolic application.
It is also useful if you want to compute speculatively.
I also don't think Cilk can handle large sets of interacting (waiting on synchronization events) between computation grains, because AFAIK in a Cilk program can only have as many live parallel computations as there are OS-offered threads. This is due to the Cilk implementation choice of living on top of standard C/C++ compilers and their stack models, which in most workstations use the one-big-stack-per-thread model. You might get to 10 or 100, but not 10,000; this matters when handling very large graphs. [I don't know for a fact that Cilk even allows computation grains to synchronize, but I don't see any technical reason why it could not if it gave up the large-stack-model].
A second implication is that Cilk applications can't recurse over huge data structures, because whatever size stack you choose is bounded, and there's some example for which you will run out of stack. (This isn't a design flaw of Cilk, just of its implementation). That's too bad, because huge things are one of the places where you want parallelism.
For an alternative, see PARLANSE, which offers
arbitrarily large number of computation grains, with work-stealing, but with heap-allocated grains and activation records. Each grain has its own context (and therefore one can implement large interacting sets of grains, because it is straightforward to save grain state when it needs to wait on an event. PARLANSE synchonization primitives include futures, semaphores, and critical function results (see below)
PARLANSE offers explicit "teams" (set of computational grains) as an abstraction, with exceptions propagating out of functions to the top of a computational grain (Java defines this to be "undefined" which is stupid), to the team parent, back to all the other team children as an asynchronous abort-exception (catchable in a try clause) allowing the other children to clean up.
Because some actions (e.g., asynchronous abort-exceptions) can occur at arbitrary times, PARLANSE offers the notion of critical functions, whose results are guaranteed to be returned to the caller atomically, so a function either returns a result assuredly, or does not, and the function can clean up resources safely in an asynch abort handler.
Special "partial order" teams allows one to encode computations in which the partial order is known; I think this is more efficient than Cilk's nested parallelism if you have large sets of such.
(We use PARLANSE to implement large-scale program analysis/transformation tools.. PARLANSE was invented to support this; we want parallelism because the artifacts we deal with are huge, well, trees and graphs, representing millions of lines of code).
(PARLANSE doesn't do streams or SIMD either, but they aren't out of scope of the language. One could arguably add streams and SIMD to C and C++ but its likely to be pretty hard).
This is pure just for interest question, any sort of questions are welcome.
So is it possible to create thread-safe collections without any locks? By locks I mean any thread synchronization mechanisms, including Mutex, Semaphore, and even Interlocked, all of them. Is it possible at user level, without calling system functions? Ok, may be implementation is not effective, i am interested in theoretical possibility. If not what is the minimum means to do it?
EDIT: Why immutable collections don't work.
This of class Stack with methods Add that returns another Stack.
Now here is program:
Stack stack = new ...;
ThreadedMethod()
{
loop
{
//Do the loop
stack = stack.Add(element);
}
}
this expression stack = stack.Add(element) is not atomic, and you can overwrite new stack from other thread.
Thanks,
Andrey
There seem to be misconceptions by even guru software developers about what constitutes a lock.
One has to make a distinction between atomic operations and locks. Atomic operations like compare and swap perform an operation (which would otherwise require two or more instructions) as a single uninterruptible instruction. Locks are built from atomic operations however they can result in threads busy-waiting or sleeping until the lock is unlocked.
In most cases if you manage to implement an parallel algorithm with atomic operations without resorting to locking you will find that it will be orders of magnitude faster. This is why there is so much interest in wait-free and lock-free algorithms.
There has been a ton of research done on implementing various wait-free data-structures. While the code tends to be short, they can be notoriously hard to prove that they really work due to the subtle race conditions that arise. Debugging is also a nightmare. However a lot of work has been done and you can find wait-free/lock-free hashmaps, queues (Michael Scott's lock free queue), stacks, lists, trees, the list goes on. If you're lucky you'll also find some open-source implementations.
Just google 'lock-free my-data-structure' and see what you get.
For further reading on this interesting subject start from The Art of Multiprocessor Programming by Maurice Herlihy.
Yes, immutable collections! :)
Yes, it is possible to do concurrency without any support from the system. You can use Peterson's algorithm or the more general bakery algorithm to emulate a lock.
It really depends on how you define the term (as other commenters have discussed) but yes, it's possible for many data structures, at least, to be implemented in a non-blocking way (without the use of traditional mutual-exclusion locks).
I strongly recommend, if you're interested in the topic, that you read the blog of Cliff Click -- Cliff is the head guru at Azul Systems, who produce hardware + a custom JVM to run Java systems on massive and massively parallel (think up to around 1000 cores and in the hundreds of gigabytes of RAM area), and obviously in those kinds of systems locking can be death (disclaimer: not an employee or customer of Azul, just an admirer of their work).
Dr Click has famously come up with a non-blocking HashTable, which is basically a complex (but quite brilliant) state machine using atomic CompareAndSwap operations.
There is a two-part blog post describing the algorithm (part one, part two) as well as a talk given at Google (slides, video) -- the latter in particular is a fantastic introduction. Took me a few goes to 'get' it -- it's complex, let's face it! -- but if you presevere (or if you're smarter than me in the first place!) you'll find it very rewarding.
I don't think so.
The problem is that at some point you will need some mutual exclusion primitive (perhaps at the machine level) such as an atomic test-and-set operation. Otherwise, you could always devise a race condition. Once you have a test-and-set, you essentially have a lock.
That being said, in older hardware that did not have any support for this in the instruction set, you could disable interrupts and thus prevent another "process" from taking over but effectively constantly putting the system into a serialized mode and forcing sort of a mutual exclusion for a while.
At the very least you need atomic operations. There are lock free algorithms for single cpu's. I'm not sure about multiple CPU's