Related
As we know, work items running on GPUs could diverge when there are conditional branches. One of those mentions exist in Apple's OpenCL Programming Guide for Mac.
As such, some portions of an algorithm may run "single-threaded", having only 1 work item running. And when it's especially serial and long-running, some applications take those work back to CPU.
However, this question concerns only GPU and assume those portions are short-lived. Do these "single-threaded" portions also diverge (as in execute both true and false code paths) when they have conditional branches? Or will the compute units (or processing elements, whichever your terminology prefers) skip those false branches?
Update
In reply to comment, I'd remove the OpenCL tag and leave the Vulkan tag there.
I included OpenCL as I wanted to know if there's any difference at all between clEnqueueTask and clEnqueueNDRangeKernel with dim=1:x=1. The document says they're equivalent but I was skeptical.
I believe Vulkan removed the special function to enqueue a single-threaded task for good reasons, and if I'm wrong, please correct me.
Do these "single-threaded" portions also diverge (as in execute both true and false code paths) when they have conditional branches?
From an API point of view it has to appear to the program that only the active branch paths were taken. As to what actually happens, I suspect you'll never know for sure. GPU hardware architectures are nearly all confidential so it's impossible to be certain.
There are really two cases here:
Cases where a branch in the program turns into a real branch instruction.
Cases where a branch in the program turns into a conditional select between two computed values.
In the case of a real branch I would expect most cases to only execute the active path because it's a horrible waste of power to do both, and GPUs are all about energy efficiency. That said, YMMV and this isn't guaranteed at all.
For simple branches the compiler might choose to use a conditional select (compute both results, and then select the right answer). In this case you will compute both results. The compiler heuristics will generally aim to choose this where computing both results is less expensive than actually having a full branch.
I included OpenCL as I wanted to know if there's any difference at all between clEnqueueTask and clEnqueueNDRangeKernel with dim=1:x=1. The document says they're equivalent but I was skeptical.
Why would they be different? They are doing the same thing conceptually ...
I believe Vulkan removed the special function to enqueue a single-threaded task for good reasons, and if I'm wrong, please correct me.
Vulkan compute dispatch is in general a whole load simpler than OpenCL (and also perfectly adequate for most use cases), so many of the host-side functions from OpenCL have no equivalent in Vulkan. The GPU side behavior is pretty much the same. It's also worth noting that most of the holes where Vulkan shaders are missing features compared to OpenCL are being patched up with extensions - e.g. VK_KHR_shader_float16_int8 and VK_KHR_variable_pointers.
Q : Or will the compute units skip those false branches?
The ecosystem of CPU / GPU code-execution is rather complex.
The layer of hardware is where the code-paths (translated into "machine"-code) operate. On this laye, the SIMD-Computing-Units cannot and will not skip anything they are ordered to SIMD-process by the hardware-scheduler (next layer).
The layer of hardware-specific scheduler (GPUs have typically right two-modes: a WARP-mode scheduling for coherent, non-diverging code-paths efficiently scheduled in SIMD-blocks and greedy-mode scheduling). From this layer, the SIMD-Computing-Units are loaded to work on SIMD-operated blocks-of-work, so any first divergence detected on the lower layer (above) breaks the execution, flags the SIMD-hardware scheduler about blocks, deferred to be executed later and all known SIMD-specific block-device-optimised scheduling is well-known to start to grow less-efficient and less-efficient, due to each such run-time divergence.
The layer of { OpenCL | Vulkan API }-mediated device-specific programming decides a lot about the ease or comfort of human-side programming of the wide range of the target-devices, all without knowing about its respective internal constraints, about (compiler decided) preferred "machine"-code computing problem re-formulation and device-specific tricks and scheduling. A bit oversimplified battlefield picture has made for years human-users just stay "in front" of the mediated asynchronous work-units ( kernel's ) HOST-to-DEVICE scheduling queues and wait until we receive back the DEVICE-to-HOST delivered results back, doing some prior-H2D/posterior-D2H memory transfers, if allowed and needed.
The HOST-side DEVICE-kernel-code "scheduling" directives are rather imperative and help the mediated-device-specific programming reflect user-side preferences, yet leave user blind from seeing all internal decisions ( assembly-level reviews are indeed only for hard-core, DEVICE-specific, GPU-engineering Aces and hard to modify, if willing to )
All that said, "adaptive" run-time values' based decisions to move a particular "work-unit" back-to-the-HOST-CPU, rather than finalising it all in DEVICE-GPU, are not, to the best of my knowledge, taking place on the bottom of this complex computing ecosystem hierarchy ( afaik, it would be exhaustively expensive to try to do so ).
From what I gathered on the lock free programming, it is incredibly hard to do right... and I agree.
Just thinking about some problems makes my head hurt. But what I wonder is, why isn't
there a widespread use of high-level wrappers around (e.g. lock free queue and similar stuff)?
For example boost has no lock free library, although one was suggested as far as I know.
I mean I guess that there is a lot of applications where you cant avoid the fact that the critical
section is the big part of the load. So what are the reasons? Is it...
Patents - I heard that some stuff related to lock-free programming is patented.
Performance.
Google, and Microsoft have internal libraries like that but none of them are public...
Something else?
So my question is: Why are high level abstractions that use lock free programming deep down not very
popular, while at the same time "regular" multi-threaded programming is "in"?
EDIT: boost got a lockfree lib :)
There are few people who are familiar enough with the field to implement easy-to-use lock-free libraries. Of those few, even fewer publish work for free and of those almost none do the vital additional work to make the library useable - e.g. publish full API docs, etc. They tend to just release a zip file with code in, which is almost useless. Then of course you also need to find a library which is written in the language you want to use, compiles on the platform you're using and finally, word of the library has to get out, so people know it exists.
Patents are an issue, in that they limit what can be offered. There is, for example, to my knowledge no unpatented singly-linked list. All the skip list stuff is heavily patented, too.
A hero in this field is Cliff Click, who came up with a lock-free hash, which he has more-or-less placed in the public domain.
You can find my lock-free library here;
http://www.liblfds.org
Another is Samy Bahra's Concurrency Kit;
http://www.concurrencykit.org
FYI Microsoft's .Net framework gained some lock free classes in .Net 4.0. Namely container classes in the System.Collections.Concurrent namespace, which are:
ConcurrentDictionary
ConcurrentQueue
ConcurrentStack
I've looked into their implementation and they are relatively fiddly/complex under the hood therefore they do represent a significant amount of effort in designing and testing (threading issues are of course notoriously difficult to test to a high standard).
You can take a look at libcds C++ library. It is collection of lock-free containers (stacks, queues, sets and maps) and safe memory reclamation algorithms.
IMHO regarding C++ (I'm not advanced in other languages). New C++ standard has just been released and the compiler developers need a time to implement its requirements. Today, all compilers do not support C++11 memory model entirely since it requires significant changes in compiler’s optimization rules. Recently, Microsoft announces support of the atomic operations that is the base of lock-free programming in VC++ 11 Developer Preview. It is good news for us. As I know, GCC is going to support it in 4.8 (or above).
Second problem is patents. Many interesting lock-free container algorithms are patented that is a barrier to include them to vendor’s libraries.
Third, the main part of lock-free containers is garbage collecting (safe memory reclamation). C++ is free from any GC (fortunately). There are a few GC algos (Hazard Pointer, Pass-the-Buck, epoch-based and so on) but most of them are patented too.
Fourth, not enough instruments to prove the correctness of memory fences applied in your lock-free implementation. Now I known only one – relacy(http://www.1024cores.net/home/relacy-race-detector).
I think after 2-3 years we’ll see many production-ready multiplatform C++ libraries of lock-free containers and algorithms. These libraries are being developed by vendors and enthusiasts.
However, in my opinion, our future is the hardware transaction memory (HTM). Today AMD, Sun (sorry, Oracle), Intel (?) are investigating HTM with very interesting results. Let’s wait.
There is at least one "lock free” framework that is somewhat popular: Erlang.
One major problem is that unless one uses an excessive number of memory barriers, it's hard to be certain that one has enough; if one does use an excessive number of memory barriers, performance is likely to be inferior to what one would have gotten using locks.
The biggest problem with locks is not performance, but robustness. If a thread gets waylaid while it holds a lock, the system dies. By contrast, if a thread which is accessing a lock-free data structure gets waylaid, it won't affect other threads' use thereof. In some situations, a lock-free data structure may be preferable to one using locks, even if performance is inferior, because one must protect the system from being brought down by a malfunctioning thread (for example, even if one was prepared to kill off a thread which hit a StackOverflowException without taking down the process, how would one protect against a thread putting a lot of stuff on its stack before calling a method to access a lock-protected data structure that the method, such that the lock-guarded method hit a stack overflow?) If one uses lock-free data structures, such risks aren't a problem.
This is pure just for interest question, any sort of questions are welcome.
So is it possible to create thread-safe collections without any locks? By locks I mean any thread synchronization mechanisms, including Mutex, Semaphore, and even Interlocked, all of them. Is it possible at user level, without calling system functions? Ok, may be implementation is not effective, i am interested in theoretical possibility. If not what is the minimum means to do it?
EDIT: Why immutable collections don't work.
This of class Stack with methods Add that returns another Stack.
Now here is program:
Stack stack = new ...;
ThreadedMethod()
{
loop
{
//Do the loop
stack = stack.Add(element);
}
}
this expression stack = stack.Add(element) is not atomic, and you can overwrite new stack from other thread.
Thanks,
Andrey
There seem to be misconceptions by even guru software developers about what constitutes a lock.
One has to make a distinction between atomic operations and locks. Atomic operations like compare and swap perform an operation (which would otherwise require two or more instructions) as a single uninterruptible instruction. Locks are built from atomic operations however they can result in threads busy-waiting or sleeping until the lock is unlocked.
In most cases if you manage to implement an parallel algorithm with atomic operations without resorting to locking you will find that it will be orders of magnitude faster. This is why there is so much interest in wait-free and lock-free algorithms.
There has been a ton of research done on implementing various wait-free data-structures. While the code tends to be short, they can be notoriously hard to prove that they really work due to the subtle race conditions that arise. Debugging is also a nightmare. However a lot of work has been done and you can find wait-free/lock-free hashmaps, queues (Michael Scott's lock free queue), stacks, lists, trees, the list goes on. If you're lucky you'll also find some open-source implementations.
Just google 'lock-free my-data-structure' and see what you get.
For further reading on this interesting subject start from The Art of Multiprocessor Programming by Maurice Herlihy.
Yes, immutable collections! :)
Yes, it is possible to do concurrency without any support from the system. You can use Peterson's algorithm or the more general bakery algorithm to emulate a lock.
It really depends on how you define the term (as other commenters have discussed) but yes, it's possible for many data structures, at least, to be implemented in a non-blocking way (without the use of traditional mutual-exclusion locks).
I strongly recommend, if you're interested in the topic, that you read the blog of Cliff Click -- Cliff is the head guru at Azul Systems, who produce hardware + a custom JVM to run Java systems on massive and massively parallel (think up to around 1000 cores and in the hundreds of gigabytes of RAM area), and obviously in those kinds of systems locking can be death (disclaimer: not an employee or customer of Azul, just an admirer of their work).
Dr Click has famously come up with a non-blocking HashTable, which is basically a complex (but quite brilliant) state machine using atomic CompareAndSwap operations.
There is a two-part blog post describing the algorithm (part one, part two) as well as a talk given at Google (slides, video) -- the latter in particular is a fantastic introduction. Took me a few goes to 'get' it -- it's complex, let's face it! -- but if you presevere (or if you're smarter than me in the first place!) you'll find it very rewarding.
I don't think so.
The problem is that at some point you will need some mutual exclusion primitive (perhaps at the machine level) such as an atomic test-and-set operation. Otherwise, you could always devise a race condition. Once you have a test-and-set, you essentially have a lock.
That being said, in older hardware that did not have any support for this in the instruction set, you could disable interrupts and thus prevent another "process" from taking over but effectively constantly putting the system into a serialized mode and forcing sort of a mutual exclusion for a while.
At the very least you need atomic operations. There are lock free algorithms for single cpu's. I'm not sure about multiple CPU's
I came across the following statement in Trapexit, an Erlang community website:
Erlang is a programming language used
to build massively scalable soft
real-time systems with requirements on
high availability.
Also I recall reading somewhere that Twitter switched from Ruby to Scala to address scalability problem.
Hence, I wonder what is the relation between a programming language and scalability?
I would think that scalability depends only on the system design, exception handling etc. Is it because of the way a language is implemented, the libraries, or some other reasons?
Hope for enlightenment. Thanks.
Erlang is highly optimized for a telecommunications environment, running at 5 9s uptime or so.
It contains a set of libraries called OTP, and it is possible to reload code into the application 'on the fly' without shutting down the application! In addition, there is a framework of supervisor modules and so on, so that when something fails, it gets automatically restarted, or else the failure can gradually work itself up the chain until it gets to a supervisor module that can deal with it.
That would be possible in other languages of course too. In C++, you can reload dlls on the fly, load plugsin. In Python you can reload modules. In C#, you can load code in on-the-fly, use reflection and so on.
It's just that that functionality is built in to Erlang, which means that:
it's more standard, any erlang developer knows how it works
less stuff to re-implement oneself
That said, there are some fundamental differences between languages, to the extent that some are interpreted, some run off bytecode, some are native compiled, so the performance, and the availability of type information and so on at runtime differs.
Python has a global interpreter lock around its runtime library so cannot make use of SMP.
Erlang only recently had changes added to take advantage of SMP.
Generally I would agree with you in that I feel that a significant difference is down to the built-in libraries rather than a fundamental difference between the languages themselves.
Ultimately I feel that any project that gets very large risks getting 'bogged down' no matter what language it is written in. As you say I feel architecture and design are pretty fundamental to scalability and choosing one language over another will not I feel magically give awesome scalability...
Erlang comes from another culture in thinking about reliability and how to achieve it. Understanding the culture is important, since Erlang code does not become fault-tolerant by magic just because its Erlang.
A fundamental idea is that high uptime does not only come from a very long mean-time-between-failures, it also comes from a very short mean-time-to-recovery, if a failure happened.
One then realize that one need automatic restarts when a failure is detected. And one realize that at the first detection of something not being quite right then one should "crash" to cause a restart. The recovery needs to be optimized, and the possible information losses need to be minimal.
This strategy is followed by many successful softwares, such as journaling filesystems or transaction-logging databases. But overwhelmingly, software tends to only consider the mean-time-between-failure and send messages to the system log about error-indications then try to keep on running until it is not possible anymore. Typically requiring human monitoring the system and manually reboot.
Most of these strategies are in the form of libraries in Erlang. The part that is a language feature is that processes can "link" and "monitor" each other. The first one is a bi-directional contract that "if you crash, then I get your crash message, which if not trapped will crash me", and the second is a "if you crash, i get a message about it".
Linking and monitoring are the mechanisms that the libraries use to make sure that other processes have not crashed (yet). Processes are organized into "supervision" trees. If a worker process in the tree fails, the supervisor will attempt to restart it, or all workers at the same level of that branch in the tree. If that fails it will escalate up, etc. If the top level supervisor gives up the application crashes and the virtual machine quits, at which point the system operator should make the computer restart.
The complete isolation between process heaps is another reason Erlang fares well. With few exceptions, it is not possible to "share values" between processes. This means that all processes are very self-contained and are often not affected by another process crashing. This property also holds between nodes in an Erlang cluster, so it is low-risk to handle a node failing out of the cluster. Replicate and send out change events rather than have a single point of failure.
The philosophies adopted by Erlang has many names, "fail fast", "crash-only system", "recovery oriented programming", "expose errors", "micro-restarts", "replication", ...
Erlang is a language designed with concurrency in mind. While most languages depend on the OS for multi-threading, concurrency is built into Erlang. Erlang programs can be made from thousands to millions of extremely lightweight processes that can run on a single processor, can run on a multicore processor, or can run on a network of processors. Erlang also has language level support for message passing between processes, fault-tolerance etc. The core of Erlang is a functional language and functional programming is the best paradigm for building concurrent systems.
In short, making a distributed, reliable and scalable system in Erlang is easy as it is a language designed specially for that purpose.
In short, the "language" primarily affects the vertical axii of scaling but not all aspects as you already eluded to in your question. Two things here:
1) Scalability needs to be defined in relation to a tangible metric. I propose money.
S = # of users / cost
Without an adequate definition, we will discussing this point ad vitam eternam. Using my proposed definition, it becomes easier to compare system implementations. For a system to be scalable (read: profitable), then:
Scalability grows with S
2) A system can be made to scale based on 2 primary axis:
a) Vertical
b) Horizontal
a) Vertical scaling relates to enhancing nodes in isolation i.e. bigger server, more RAM etc.
b) Horizontal scaling relates to enhancing a system by adding nodes. This process is more involving since it requires dealing with real world properties such as speed of light (latency), tolerance to partition, failures of many kinds etc.
(Node => physical separation, different "fate sharing" from another)
The term scalability is too often abused unfortunately.
Too many times folks confuse language with libraries & implementation. These are all different things. What makes a language a good fit for a particular system has often more to do with the support around the said language: libraries, development tools, efficiency of the implementation (i.e. memory footprint, performance of builtin functions etc.)
In the case of Erlang, it just happens to have been designed with real world constraints (e.g. distributed environment, failures, need for availability to meet liquidated damages exposure etc.) as input requirements.
Anyways, I could go on for too long here.
First you have to distinguish between languages and their implementations. For instance ruby language supports threads, but in the official implementation, the thread will not make use of multicore chips.
Then, a language/implementation/algorithm is often termed scalable when it supports parallel computation (for instance via multithread) AND if it exhibits a good speedup increase when the number of CPU goes up (see Amdahl Law).
Some languages like Erlang, Scala, Oz etc. have also syntax (or nice library) which help writing clear and nice parallel code.
In addition to the points made here about Erlang (Which I was not aware of) there is a sense in which some languages are more suited for scripting and smaller tasks.
Languages like ruby and python have some features which are great for prototyping and creativity but terrible for large scale projects. Arguably their best features are their lack of "formality", which hurts you in large projects.
For example, static typing is a hassle on small script-type things, and makes languages like java very verbose. But on a project with hundreds or thousands of classes you can easily see variable types. Compare this to maps and arrays that can hold heterogeneous collections, where as a consumer of a class you can't easily tell what kind of data it's holding. This kind of thing gets compounded as systems get larger. e.g. You can also do things that are really difficult to trace, like dynamically add bits to classes at runtime (which can be fun but is a nightmare if you're trying to figure out where a piece of data comes from) or call methods that raise exceptions without being forced by the compiler to declare the exception. Not that you couldn't solve these kinds of things with good design and disciplined programming - it's just harder to do.
As an extreme case, you could (performance issues aside) build a large system out of shell scripts, and you could probably deal with some of the issues of the messiness, lack of typing and global variables by being very strict and careful with coding and naming conventions ( in which case you'd sort of be creating a static typing system "by convention"), but it wouldn't be a fun exercise.
Twitter switched some parts of their architecture from Ruby to Scala because when they started they used the wrong tool for the job. They were using Ruby on Rails—which is highly optimised for building green field CRUD Web applications—to try to build a messaging system. AFAIK, they're still using Rails for the CRUD parts of Twitter e.g. creating a new user account, but have moved the messaging components to more suitable technologies.
Erlang is at its core based on asynchronous communication (both for co-located and distributed interactions), and that is the key to the scalability made possible by the platform. You can program with asynchronous communication on many platforms, but Erlang the language and the Erlang/OTP framework provides the structure to make it manageable - both technically and in your head. For instance: Without the isolation provided by erlang processes, you will shoot yourself in the foot. With the link/monitor mechanism you can react on failures sooner.
As a software developer dealing mostly with high-level programming languages I'm not sure what I can do to appropriately pay attention to the upcoming omni-presence of multicore computers. I write mostly ordinary and non-demanding applications, nevertheless I think it is important to know if I need to change any programming paradigms or even language to master the future.
My question therefore:
How to deal with increasing multicore presence in day-by-day hacking?
Herb Sutter wrote about it in 2005: The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software
Most problems do not require a lot of CPU time. Really, single cores are quite fast enough for many purposes. When you do find your program is too slow, first profile it and look at your choice of algorithms, architecture, and caching. If that doesn't get you enough, try to divide the problem up into separate processes. Often this is worth doing simply for fault isolation and so that you can understand the CPU and memory usage of each process. Also, normally each process will run on a specific core and make good use of the processor caches, so you won't have to suffer the substantial performance overhead of keeping cache lines consistent. If you go for a multi process design and still find problem needs more CPU time than you get with the machine you have, you are well placed to extend it run over a cluster.
There are situations where you need multiple threads within the same address space, but beware that threads are really hard to get right. Race conditions, especially in non-safe languages, sometimes take weeks to debug; often, simply adding tracing or running under a debugger will change the timings enough to hide the problem. Simply putting locks everywhere often means you get a lot of locking overhead and sometimes so much lock contention that you don't really get the concurrency advantage you were hoping for. Even when you've got the locking right, you then need to profile to tune for cache coherency. Ultimately, if you want to really tune some highly concurrent code, you'll probably end up looking at lock-free constructs and more complex locking schemes than those in current multi-threading libraries.
Learn the benefits of concurrency, and the limits (e.g. Amdahl's law).
So you can, where possible, exploit the only route for higher performance that is going to be open. There is a lot of innovative work happening on easier approaches (futures and task libraries), and old work being rediscovered (functional languages and immutable data).
The free lunch is over, but that does not mean that there is nothing to exploit.
In general, become very friendly with threading. It's a terrible mechanism for parallelization, but it's what we have.
If you do work with .NET, look at the Parallel Extensions. They allow you to easily accomplish many parallel programming tasks.
To benefit from more that just one core you should consider parallelizing your code. Multiple threads, immutable types, and a minimum of synchronization are your new friends.
I think it will depend on what kind of applications you're writing.
Some kind of apps benefit more of the fact that they're run on a mutli-core cpu then others.
If your application can benefit from the multi-core fact, then you should be ready to go parallel.
The free lunch is over; that is: in the past, your application became faster when a new cpu was released and you didn't have to put any effort in your application to get that extra speed.
Now, to take advantage of the capabilities a multi-core cpu offers, you've to make sure that your application can take advantage of it. That is: you've to see which tasks can be executed multithreaded / concurrently, and this brings some issues to the table ...
Learn Erlang/F# (depending on your platform)
Prefer immutable data structures, their use makes software easier to understand not only in concurrent programs.
Learn the tools for concurrency in your language (e.g. java.util.concurrent, JCIP).
Learn a functional language (e.g Haskell).
I've been asked the same question, and the answer is, "it depends". If your Joe Winforms, maybe not so much. If your writing code that must be performant, yes. One of the biggest problem I can see with parallel programming is this: if something can't be parallized, and you lie and tell the run-time to do in parallel anyways, it's not going to crash, it's just going to do things wrong, and you'll get crap results and blame the framework.
Learn OpenMP and MPI for C and C++ code.
OpenMP also applies to other languages as well like Fortran I suppose.
Write smaller programs.
Other code languages/styles will let you do multithreading better (though multithreading is still really hard in any language) but the big benefit for regular developers, IMHO, is the ability to execute lots of smaller programs concurrently to accomplish some much larger task.
So, get in the habit of breaking your problems down into independent components that can be run whenever you want.
You'll build more maintainable software too.