Are there any concurrent algorithms that in use that work correctly without any synchronization? - algorithm

All of the concurrent programs I've seen or heard details of (admittedly a small set) at some point use hardware synchronization features, generally some form of compare-and-swap. The question is: are there any concurrent programs in the wild where the thread interact throughout there life and get away without any synchronization?
Example of what I'm thinking of include:
A program that amounts to a single thread running a yes/no test on a large set of cases and a big pile of threads tagging cases based on a maybe/no tests. This doesn't need synchronization because dirty data will only effect performance rather than correctness.
A program that has many threads updating a data structure where any state that is valid now, will always be valid, so dirty reads or writes don't invalidate anything. An example of this is (I think) path compression in the union-find algorithm.

If you can break work up into completely independent chunks, then yes there are concurrent algorithms whose only synchronisation point is the one at the end of the work where all threads join. Parallel speedup is then a factor of being able to break into tasks whose sizes are as similiar as possible.

Some indirect methods for solving systems of linear equations, like Successive over-relaxation ( http://en.wikipedia.org/wiki/Successive_over-relaxation ), don't really need the iterations to be synchronized.

I think it's a bit trick question because e.g. if you program in C, malloc() must be multi-thread safe and uses hardware synchronization, and in Java the garbage collector requires hardware synchronization anyway. All Java programs require the GC, and hardly any C program makes it without malloc() (or C++ program / new() operator).

There is a whole class of algorithms which are sometimes referred to as "embarallel" (contraction of "embarrassingly parallel"). Many image processing algorithms fall into this class, where each pixel may be processed independently (which makes implementation with e.g. SIMD or GPGPU very straightforward).

Well, without any synchronization at all (even at the end of the algorithm) you obviously can't do anything useful because you can't even transfer the results of concurrent computations to the main thread: suppose that they were on remote machines without any communication channels to the main machine.

The simplest example is inside java.lang.String which is immutable and lazily caches its hash code. This cache is written to without synchronization because (a) its cheaper, (b) the value is recomputable, and (c) JVM guarantees no tearing. The tolerance of data races in purely functional contexts allows tricks like this to be used safely without explicit synchronization.

I agree with Mitch's answer. I would like to add that the ray tracing algorithm can work without synchronization until the point where all threads join.

Related

Spinlock implementation reasoning

I want to improve the performance of a program by replacing some of the mutexes
with spinlocks. I have found a spinlock implementation in
http://www.boost.org/doc/libs/1_36_0/boost/detail/spinlock_sync.hpp
which I intend to reuse. I believe this implementation is safer than simpler implementations in which threads keep trying forever like the one found here
http://www.boost.org/doc/libs/1_54_0/doc/html/atomic/usage_examples.html#boost_atomic.usage_examples.example_spinlock.implementation
But i need to clarify some things on the yield function found here
http://www.boost.org/doc/libs/1_36_0/boost/detail/yield_k.hpp
First of all I can assume that the numbers 4,16,32 are arbitrary. I actually tested some other values and I have found that I got best performance in my case by using other values.
But can someone explain the reasoning behind the yield code. Specifically why do we need all three
BOOST_SMT_PAUSE
sched_yield and
nanosleep
Yes, this concept is known as "adaptive spinlock" - see e.g. https://lwn.net/Articles/271817/.
Usually the numbers are chosen for exponential back-off: https://geidav.wordpress.com/tag/exponential-back-off/
So, the numbers aren't arbitrary. However, which "numbers" work for your case depend on your application patterns, requirements and system resources.
The three methods to introduce "micro-delays" are designed explicitly to balance the cost and the potential gain:
zero-cost is to spin on high-CPU, but it results in high power consumption and wasted cycles
a small "cheap" delay might be able to prevent the cost of a context-switch while reducing the CPU load relative to a busy-spin
a simple yield might allow the OS to avoid a context switch depending on other system load (e.g. if the number of threads < number logical cores)
The trade-offs with these are important for low-latency applications where the effect of a context switch or cache misses are significant.
TL;DR
All trade-offs try to find a balance between wasting CPU cycles and losing cache/thread efficiency.

Can parallelization have a negative performance impact?

With the abundance of techniques being employed to increase parallelization in today's compiler-tools (especially auto-parallelization of certain viable for-constructs, c.f. the Intel C++ Compiler, Microsoft Visual Studio 2011, alongside various others), I wondered if parallelization is always guaranteed to improve or have no impact on performance.
Are there any cases in which parallelization would have a distinctly negative impact on performance?
A quick internet search didn't yield much hope, so I decided to turn here to see if anyone has any knowledge of cases where parallelization has a detrimental impact on performance, or better yet, experience in a project where parallelization actually caused difficulties.
I am also curious about whether there are any negative performance implication of auto-vectorization, although I find it quite unlikely that there would be.
Thanks in advance!
Parallelisation usually involves some abstract data exchange between the different processing elements since not all of them have exclusive access to all the data that it needs in order to complete its part of the computation. It could either be messages passed between different processes in an MPI job or it could be synchronisation actions in a multithreaded program. Passing data around or synchronising things takes time and that's why it is usually called communication or synchronisation overhead. There are different classes of problems depending on the ratio between overhead and computation.
Parallel algorithms that require no communication or synchronisation at all are called trivially (or "embarrassingly") parallel problems. An example of this class is a ray-tracing application: each pixel can be computed independently of all the others. Problems in this class scale linearly with the number of processing elements used (and sometimes even superlinearly because of caching effects) - give it twice as many processing elements and it will take twice as less time to perform the computation.
If any amount of communication or synchronisation is involved then things get progressively worse as the ratio between communication/synchronisation and computation increases. Usually this is the case when the problem size is kept fixed as one increases the number of processing elements. Usually the overhead increases with the number of processing elements while the amount of computation per element decreases.
Auto-vectorization can theoretically fall into "traps" where the overhead of getting all the elements in the right places is actually bigger than the time saved by doing things in parallel. Analyzing how much time a piece of code will take is hard, so it's hard for compilers to make the right decision.
Towards the end of these slides are some examples and statistics about auto-vectorization making the performance worse.
Usually with reasonable usage parallelization (mean parallel processing) gives positive performance imact.
But in some cases, from developer point of view, it could cause negative effects:
When allocating to many thread for parallel and/or multithreading processing.
Fork/join parallelism and loops parallelization when iteration is to small and allocating threads costs more time and resources than simple to process items synchronously
Typical multithreading/parallel execution problems like deadlocks, livelocks, threads stravation, race conditions etc.
Debugging and diagnostic, it's harder to find bugs
So all should be used reasonably.
And some links. Sorry they are .NET/Microsoft specific but problems described there are same:
Potential Pitfalls in Data and Task Parallelism
Potential Pitfalls with Parallel LINQ (PLINQ)
Good book where common problems and pitfalls are described:
Patterns for Parallel Programming: Understanding and Applying Parallel Patterns with the .NET Framework 4
From a more theoretical point of view, you may be interested in problems that are not in NC, i.e. the class of decision problems decidable in polylogarithmic time on a parallel computer with a polynomial number of processors.
Off the top of my head, I cannot think of any computational problem that is not, in some way or another, parallelizable. What I have encountered many times though are problems that have been badly parallelized.
Badly parallelized programs can easily be slower than their sequential versions. This can be a result of:
Massive overheads due to the parallelism being too fine-grained, e.g. the amount of work performed per thread is negligible compared to the overhead of starting/scheduling the operation. In OpenMP, this could be the case of a #pragma omp parallel for schedule(dynamic,k) for a small chunk size k.
Repeated concurrent access to shared resources, e.g. if all threads have to wait to access some resource or memory location sequentially. In OpenMP, this can be caused by too many or too large #pragma omp critical sections.
Over-use of slow atomic operations to update variables shared between threads, e.g. using #pragma omp atomic where, in the sequential case, faster regular memory access would be used.
In summary, and in my opinion, there are few inherently sequential problems, but mountains of badly-implemented parallel solutions.

Haskell: Concurrent data structure guidelines

I've been trying to get a understanding of concurrency, and I've been trying to work out what's better, one big IORef lock or many TVars. I've came to the following guidelines, comments will be appreciated, regarding whether these are roughly right or whether I've missed the point.
Lets assume our concurrent data structure is a map m, accessed like m[i]. Lets also say we have two functions, f_easy and f_hard. The f_easy is quick, f_hard takes a long time. We'll assume the arguments to f_easy/f_hard are elements of m.
(1) If your transactions look roughly like this m[f_easy(...)] = f_hard(...), use an IORef with atomicModifyIORef. Laziness will ensure that m is only locked for a short time as it's updated with a thunk. Calculating the index effectively locks the structure (as something is going to get updated, but we don't know what yet), but once it's known what that element is, the thunk over the entire structure moves to a thunk only over that particular element, and then only that particular element is "locked".
(2) If your transactions look roughly like this m[f_hard(...)] = f_easy(...), and the don't conflict too much, use lots of TVars. Using an IORef in this case will effectively make the app single threaded, as you can't calculate two indexes at the same time (as there will be an unresolved thunk over the entire structure). TVars let you work out two indexes at the same time, however, the negative is that if two concurrent transactions both access the same element, and one of them is a write, one transaction must be scrapped, which wastes time (which could have been used elsewhere). If this happens a lot, you may be better with locks that come (via blackholing) from IORef, but if it doesn't happen very much, you'll get better parallelism with TVars.
Basically in case (2), with IORef you may get 100% efficiency (no wasted work) but only use 1.1 threads, but with TVar if you have a low number of conflicts you might get 80% efficiency but use 10 threads, so you still end up 7 times faster even with the wasted work.
Your guidelines are somewhat similar to the findings of [1] (Section 6) where the performance of the Haskell STM is analyzed:
"In particular, for programs that do not perform much work inside transactions, the commit overhead appears to be very high. To further observe this overhead, an analysis needs to be conducted on the performance of commit-time course-grain and fine-grain STM locking mechanisms."
I use atomicModifyIORef or an MVar when all the synchronization I need is something that simple locking will ensure. When looking at concurrent accesses to a data structure, it also depends on how this data structure is implemented. For example, if you store your data inside a IORef Data.Map and frequently perform read/write access then I think atmoicModifyIORef will degrade to a single thread performance, as you have conjectured, but the same will be true for a TVar Data.Map. My point is that it's important to use a data structure that is suitable for concurrent programming (balanced trees aren't).
That said, in my opinion the winning argument for using STM is composability: you can combine multiple operations into a single transactions without headaches. In general, this isn't possible using IORef or MVar without introducing new locks.
[1] The limits of software transactional memory (STM): dissecting Haskell STM applications on a many-core environment.
http://dx.doi.org/10.1145/1366230.1366241
Answer to #Clinton's comment:
If a single IORef contains all your data, you can simply use atomicModifyIORef for composition. But if you need to process lots of parallel read/write requests to that data, the performance loss might become significant, since every pair of parallel read/write requests to that data might cause a conflict.
The approach that I would try is to use a data structure where the entries themselves are stored inside a TVar (vs putting the whole data structure into a single TVar). That should reduce the possibility of livelocks, as transactions won't conflict that often.
Of course, you still want to keep your transactions as small as possible and use composability only if it's absolutely necessary to guarantee consistency. So far I haven't encountered a scenario where combining more than a few insert/lookup operations into a single transaction was necessary.
Beyond performance, I see a more fundamental reason to using TVar--the type system ensures you dont do any "unsafe" operations like readIORef or writeIORef. That your data is shared is a property of the type, not of the implementation. EDIT: unsafePerformIO is always unsafe. readIORef is only unsafe if you are also using atomicModifyIORef. At the very least wrap your IORef in a newtype and only expose a wrapped atomicModifyIORef
Beyond that, don't use IORef, use MVar or TVar
The first usage pattern you describe probably does not have nice performance characteristics. You likely end up being (almost) entirely single threaded--because of laziness no actual work happens each time you update the shared state, but whenever you need to use this shared state, the entire accumulated pile of thunks needs to be forced, and has a linear data dependency structure.
Having 80% efficiency but substantially higher parallelism allows you to exploit growing number of cores. You can expect minimal performance improvements over the coming years on single threaded code.
Many word CAS is likely coming to a processor near you in the form of "Hardware Transactional Memory" allowing STMs to become far more efficient.
Your code will be more modular--every piece of code has to be changed if you add more shared state when your design has all shared state behind a single reference. TVars and to a lesser extent MVars support natural modularity.

When should we use a scatter/gather(vectored) IO?

Windows file system supports scatter/gather IO.(Of course, other platform does)
But I don't know when do I use the IO mechanism.
Could you explain me a proper case?
And what benefit can we get from using the I/O mechanism?(Just a little IO request?)
You use Scatter/Gather IO when you are doing lots of random (i.e. non-sequential) reads / writes, and you want to save on context switches / syscalls - Scatter/Gather is a form of batching in this sense. However, unless you've got a very fast disk (or more likely, a large array of disks), the syscall cost is negligible.
If you were writing a Database server, you might care about this, but anything less than a big-iron machine handling thousands or millions of requests a second won't see any benefit.
Paul -- one extra note: one additional advantage is that you hand multiple requests to the disk driver at the same time. The driver then can sort the requests and issue them in the optimal order. While syscall time is small, seek time (many milliseconds) can be punitive (that's less than 1000 I/O's/sec).
Chris's comment about demonstrating the efficiency is pragmatic. Mother nature never lies. Well, almost never.
I would imagine that you would use scatter gatehr IO when you (a) suspected your application had a performance bottleneck, and (b) you built a performance analysis framework that could show significant improvment using it.
Unless you can show a provable improvement, the additional code complexity is just a risk, and theres no magic recipe that says that, when some condition is met, and application will automatically benefit in a significant way from some programming cleverness.
Or - to put it another way - dont base major architectural decisions based on the statements of 'some guy on an internet forum'. Create a test, and find out.
in posix, readv and writev read from or write to discontinuous memory but to read and write discontinuous file ranges from discontinuous memory in one go you want readx and writex which were one of the proposed posix additions
doing a readx is faster then doing a lot of reads as it's only one system call and it lets the disk scheduler have the most io's to reorder i remember some one saying that for the ext2/3/.. fsck program that they wanted this as it knows what ranges it wants

What to avoid for performance reasons in multithreaded code?

I'm currently reviewing/refactoring a multithreaded application which is supposed to be multithreaded in order to be able to use all the available cores and theoretically deliver a better / superior performance (superior is the commercial term for better :P)
What are the things I should be aware when programming multithreaded applications?
I mean things that will greatly impact performance, maybe even to the point where you don't gain anything with multithreading at all but lose a lot by design complexity. What are the big red flags for multithreading applications?
Should I start questioning the locks and looking to a lock-free strategy or are there other points more important that should light a warning light?
Edit: The kind of answers I'd like are similar to the answer by Janusz, I want red warnings to look up in code, I know the application doesn't perform as well as it should, I need to know where to start looking, what should worry me and where should I put my efforts. I know it's kind of a general question but I can't post the entire program and if I could choose one section of code then I wouldn't be needing to ask in the first place.
I'm using Delphi 7, although the application will be ported / remake in .NET (c#) for the next year so I'd rather hear comments that are applicable as a general practice, and if they must be specific to either one of those languages
One thing to definitely avoid is lots of write access to the same cache lines from threads.
For example: If you use a counter variable to count the number of items processed by all threads, this will really hurt performance because the CPU cache lines have to synchronize whenever the other CPU writes to the variable.
One thing that decreases performance is having two threads with much hard drive access. The hard drive would jump from providing data for one thread to the other and both threads would wait for the disk all the time.
Something to keep in mind when locking: lock for as short a time as possible. For example, instead of this:
lock(syncObject)
{
bool value = askSomeSharedResourceForSomeValue();
if (value)
DoSomethingIfTrue();
else
DoSomtehingIfFalse();
}
Do this (if possible):
bool value = false;
lock(syncObject)
{
value = askSomeSharedResourceForSomeValue();
}
if (value)
DoSomethingIfTrue();
else
DoSomtehingIfFalse();
Of course, this example only works if DoSomethingIfTrue() and DoSomethingIfFalse() don't require synchronization, but it illustrates this point: locking for as short a time as possible, while maybe not always improving your performance, will improve the safety of your code in that it reduces surface area for synchronization problems.
And in certain cases, it will improve performance. Staying locked for long lengths of time means that other threads waiting for access to some resource are going to be waiting longer.
More threads then there are cores, typically means that the program is not performing optimally.
So a program which spawns loads of threads usually is not designed in the best fashion. A good example of this practice are the classic Socket examples where every incoming connection got it's own thread to handle of the connection. It is a very non scalable way to do things. The more threads there are, the more time the OS will have to use for context switching between threads.
You should first be familiar with Amdahl's law.
If you are using Java, I recommend the book Java Concurrency in Practice; however, most of its help is specific to the Java language (Java 5 or later).
In general, reducing the amount of shared memory increases the amount of parallelism possible, and for performance that should be a major consideration.
Threading with GUI's is another thing to be aware of, but it looks like it is not relevant for this particular problem.
What kills performance is when two or more threads share the same resources. This could be an object that both use, or a file that both use, a network both use or a processor that both use. You cannot avoid these dependencies on shared resources but if possible, try to avoid sharing resources.
Run-time profilers may not work well with a multi-threaded application. Still, anything that makes a single-threaded application slow will also make a multi-threaded application slow. It may be an idea to run your application as a single-threaded application, and use a profiler, to find out where its performance hotspots (bottlenecks) are.
When it's running as a multi-threaded aplication, you can use the system's performance-monitoring tool to see whether locks are a problem. Assuming that your threads would lock instead of busy-wait, then having 100% CPU for several threads is a sign that locking isn't a problem. Conversely, something that looks like 50% total CPU utilitization on a dual-processor machine is a sign that only one thread is running, and so maybe your locking is a problem that's preventing more than one concurrent thread (when counting the number of CPUs in your machine, beware multi-core and hyperthreading).
Locks aren't only in your code but also in the APIs you use: e.g. the heap manager (whenever you allocate and delete memory), maybe in your logger implementation, maybe in some of the O/S APIs, etc.
Should I start questioning the locks and looking to a lock-free strategy
I always question the locks, but have never used a lock-free strategy; instead my ambition is to use locks where necessary, so that it's always threadsafe but will never deadlock, and to ensure that locks are acquired for a tiny amount of time (e.g. for no more than the amount of time it takes to push or pop a pointer on a thread-safe queue), so that the maximum amount of time that a thread may be blocked is insignificant compared to the time it spends doing useful work.
You don't mention the language you're using, so I'll make a general statement on locking. Locking is fairly expensive, especially the naive locking that is native to many languages. In many cases you are reading a shared variable (as opposed to writing). Reading is threadsafe as long as it is not taking place simultaneously with a write. However, you still have to lock it down. The most naive form of this locking is to treat the read and the write as the same type of operation, restricting access to the shared variable from other reads as well as writes. A read/writer lock can dramatically improve performance. One writer, infinite readers. On an app I've worked on, I saw a 35% performance improvement when switching to this construct. If you are working in .NET, the correct lock is the ReaderWriterLockSlim.
I recommend looking into running multiple processes rather than multiple threads within the same process, if it is a server application.
The benefit of dividing the work between several processes on one machine is that it is easy to increase the number of servers when more performance is needed than a single server can deliver.
You also reduce the risks involved with complex multithreaded applications where deadlocks, bottlenecks etc reduce the total performance.
There are commercial frameworks that simplifies server software development when it comes to load balancing and distributed queue processing, but developing your own load sharing infrastructure is not that complicated compared with what you will encounter in general in a multi-threaded application.
I'm using Delphi 7
You might be using COM objects, then, explicitly or implicitly; if you are, COM objects have their own complications and restrictions on threading: Processes, Threads, and Apartments.
You should first get a tool to monitor threads specific to your language, framework and IDE. Your own logger might do fine too (Resume Time, Sleep Time + Duration). From there you can check for bad performing threads that don't execute much or are waiting too long for something to happen, you might want to make the event they are waiting for to occur as early as possible.
As you want to use both cores you should check the usage of the cores with a tool that can graph the processor usage on both cores for your application only, or just make sure your computer is as idle as possible.
Besides that you should profile your application just to make sure that the things performed within the threads are efficient, but watch out for premature optimization. No sense to optimize your multiprocessing if the threads themselves are performing bad.
Looking for a lock-free strategy can help a lot, but it is not always possible to get your application to perform in a lock-free way.
Threads don't equal performance, always.
Things are a lot better in certain operating systems as opposed to others, but if you can have something sleep or relinquish its time until it's signaled...or not start a new process for virtually everything, you're saving yourself from bogging the application down in context switching.

Resources