How do i write tasks? (parallel code)

How do i write tasks? (parallel code) - parallel-processing

I am impressed with intel thread building blocks. I like how i should write task and not thread code and i like how it works under the hood with my limited understanding (task are in a pool, there wont be 100 threads on 4cores, a task is not guaranteed to run because it isnt on its own thread and may be far into the pool. But it may be run with another related task so you cant do bad things like typical thread unsafe code).
I wanted to know more about writing task. I like the 'Task-based Multithreading - How to Program for 100 cores' video here http://www.gdcvault.com/sponsor.php?sponsor_id=1 (currently second last link. WARNING it isnt 'great'). My fav part was 'solving the maze is better done in parallel' which is around the 48min mark (you can click the link on the left side. That part is really all you need to watch if any).
However i like to see more code examples and some API of how to write task. Does anyone have a good resource? I have no idea how a class or pieces of code may look after pushing it onto a pool or how weird code may look when you need to make a copy of everything and how much of everything is pushed onto a pool.

Java has a parallel task framework similar to Thread Building Blocks - it's called the Fork-Join framework. It's available for use with the current Java SE 6 and to be included in the upcoming Java SE 7.
There are resources available for getting started with the framework, in addition to the javadoc class documentation. From the jsr166 page, mentions that
"There is also a wiki containing additional documentation, notes, advice, examples, and so on for these classes."
The fork-join examples, such as matrix multiplication are a good place to start.
I used the fork-join framework in solving some of Intel's 2009 threading challenges. The framework is lightweight and low-overhead - mine was the only Java entry for the Kight's Tour problem and it outperformed other entries in the competition. The java sources and writeup are available from the challenge site for download.
EDIT:
I have no idea how a class or pieces
of code may look after pushing it onto
a pool [...]
You can make your own task by subclassing one of the ForKJoinTask subclasses, such as RecursiveTask. Here's how to compute the fibonacci sequence in parallel. (Taken from the RecursiveTask javadocs - comments are mine.)
// declare a new task, that itself spawns subtasks.
// The task returns an Integer result.
class Fibonacci extends RecursiveTask<Integer> {
final int n; // the n'th number in the fibonacci sequence to compute
Fibonnaci(int n) { this.n = n; } // constructor
Integer compute() { // this method is the main work of the task
if (n <= 1) // 1 or 0, base case to end recursion
return n;
Fibonacci f1 = new Fibonacci(n - 1); // create a new task to compute n-1
f1.fork(); // schedule to run asynchronously
Fibonacci f2 = new Fibonacci(n - 2); // create a new task to compute n-2
return f2.invoke() + f1.join(); // wait for both tasks to compute.
// f2 is run as part of this task, f1 runs asynchronously.
// (you could create two separate tasks and wait for them both, but running
// f2 as part of this task is a little more efficient.
}
}
You then run this task and get the result
// default parallelism is number of cores
ForkJoinPool pool = new ForkJoinPool();
Fibonacci f = new Fibonacci(100);
int result = pool.invoke(f);
This is a trivial example to keep things simple. In practice, performance would not be so good, since the work executed by the task is trivial compared to the overhead of the task framework. As a rule of thumb, a task should perform some significant computation - enough to make the framework overhead insignificant, yet not so much that you end up with one core at the end of the problem running one large task. Splitting large tasks into smaller ones ensures that one core isn't left doing lots of work while other cores are idle - using smaller tasks keeps more cores busy, but not so small that the task does no real work.
[...] or how weird code may look when
you need to make a copy of everything
and how much of everything is pushed
onto a pool.
Only the tasks themselves are pushed into a pool. Ideally you don't want to be copying anything: to avoid interference and the need for locking, which would slow down your program, your tasks should ideally be working with independent data. Read-only data can be shared amongst all tasks, and doesn't need to be copied. If threads need to co-operate building some large data structure, it's best they build the pieces separately and then combine them at the end. The combining can be done as a separate task, or each task can add it's piece of the puzzle to the overall solution. This often does require some form of locking, but it's not a considerable performance issue if the work of the task is much greater than the the work updating the solution. My Knight's Tour solution takes this approach to update a common repository of tours on the board.
Working with tasks and concurrency is quite a paradigm shift from regular single-threaded programming. There are often several designs possible to solve a given problem, but only some of these will be suitable for a threaded solution. It can take a few attempts to get the feel for how to recast familiar problems in a multi-threaded way. The best way to learn is to look at the examples, and then try it for yourself. Always profile, and meausre the effects of varying the number of threads. You can explicitly set the number of threads (cores) to use in the pool in the pool constructor. When tasks are broken up linearly, you can expect near linear speedup as the number of threads increases.

Playing with "frameworks" which claim to be solving unsolvable (optimal task scheduling is NP hard) is not going to help you at all - reading books and than articles on concurrent algorithms will. So called "tasks" are nothing more that a fancy name for defining separability of the problem (parts that can be computed independently of each other). Class of separable problems is very small - and they are already covered in old books.
For problems which are not separable you have to plan phases and data barriers between phases to exchange data. Optimal orchestration of data barriers for simultaneous data exchange is not just NP hard but impossible to solve in a general way in principle - you'd need to examine history of all possible interleavings - that's like the power set of already exponential set (like going from N to R in math). The reason I mention is to make it clear that no software can ever do this for you and that how to do it is intrinsically dependent on actual algorithm and a make or break of whether parallelization is feasible at all (even if it's theoretically possible).
When you enter high parallelism you can't even maintain a queue, you don't even have a memory bus anymore - imagine 100 CPU-s trying to sync up on just a single shared int or trying to do memory bus arbitrage. You have to pre-plan and pre-configure everything that's going to run and essentially prove correctness on a white board. Intel's Threading Building Blocks are a small kid in that world. They are for small number of cores which can still share a memory bus. Running separable problems is a no-brainer which you can do without any "framework".
So you are back to having to read as many different parallel algorithms as you can. It normally takes 1-3 years to research approximately optimal data barrier layout for one problem. It becomes layout when you go for say 16+ cores on a single chip since only first neighbors can exchange data efficiently (during one data barrier cycle). So you'll actually learn much more by looking at CUDA and papers and results with IBM-s experimental 30-core CPU than Intel's sales pitch or some Java toy.
Beware of demo problems for which the size of resources wasted (number of cores and memory) is much bigger that the speedup they achieve. If it takes 4 cores and 4x RAM to solve something 2x faster, the solution is not scalable for parallelization.

Related

Distributing independant iterations in a subroutine over multiple machines

I have an application that's written in Fortran and there is one particular subroutine call that takes a long time to execute. I was wondering if it's possible to distribute the tasks for computation over multiple nodes.
The current serial flow of the code is as follows:
D = Some computations that give me D and it is in memory
subroutine call
<within the subroutine>
iteration from 1 .. n
{
independent operations on D
}
I wish to distribute the iterations over n/4 machines. Can someone please guide me with this? Do let me know if something's not very clear!

Depending on the underlying implementation, coarrays (F2008) may allow processing to be distributed over multiple nodes. Partitioning the iteration space across the images is relatively straightforward, communication of the results back to one image (or to all images) is where some complexity might arise. Some introductory material on coarrays can be found here.
Again, depending on the underlying implementation, DO CONCURRENT (F2008) may allow parallel processing of iterations (though unlikely to be across nodes). Restrictions exist on what can be done in the scope of a DO CONCURRENT construct that mean that iterations can be executed in any order, appropriately capable compilers may be able to then transform that further into concurrent execution.

When one has existing code and wants to parallelize incrementally (or just one routine), shared memory approaches are the "quick hit". Especially when it is known that the iterations are independant I'd first recommend looking at compiler flags for auto-parallelization, language constructs such as DO CONCURRENT (thanks to #IanH for reminding me of that), and OpenMP compiler directives.
As my extended comment is about distributed memory, however, I'll come to that.
I'll assume you don't have access to some advanced process-spawning setup on all of your potential machines. That is, you'll have processes running on various machines each being charged for the time regardless of what work is being done. Then, the work-flow looks like
Serial outer loop
Calculate D
Distribute D to the parallel environment
Inner parallel loop on subsets of D
Gather D on the master
If the processors/processes in the parallel environment are doing nothing else - or you're being charged regardless - then this is the same to you as
Outer loop
All processes calculate D
Each process works on its subset of D
Synchronize D
The communication side, MPI or coarrays (which I'd recommend in this case, again see #IanH's answer, where image synchronization etc., is as limited as a few loops with [..]) here is just in the synchronization.
As an endnote: multi-machine coarray support is very limited. ifort as I understand requires an licence beyond the basic, g95 has some support, the Cray compiler may well. That's a separate question, however. MPI would be well supported.

Would threading be beneficial for this situation?

I have a CSV file with over 1 million rows. I also have a database that contains such data in a formatted way.
I want to check and verify the data in the CSV file and the data in the database.
Is it beneficial/reduces time to thread reading from the CSV file and use a connection pool to the database?
How well does Ruby handle threading?
I am using MongoDB, also.

It's hard to say without knowing some more details about the specifics of what you want the app to feel like when someone initiates this comparison. So, to answer, some general advice that should apply fairly well regardless of the problem you might want to thread.
Threading does NOT make something computationally less costly
Threading doesn't make things less costly in terms of computation time. It just lets two things happen in parallel. So, beware that you're not falling into the common misconception that, "Threading makes my app faster because the user doesn't wait for things." - this isn't true, and threading actually adds quite a bit of complexity.
So, if you kick off this DB vs. CSV comparison task, threading isn't going to make that comparison take any less time. What it might do is allow you to tell the user, "Ok, I'm going to check that for you," right away, while doing the comparison in a separate thread of execution. You still have to figure out how to get back to the user when the comparison is done.
Think about WHY you want to thread, rather than simply approaching it as whether threading is a good solution for long tasks
Like I said above, threading doesn't make things faster. At best, it uses computing resources in a way that is either more efficient, or gives a better user experience, or both.
If the user of the app (maybe it's just you) doesn't mind waiting for the comparison to run, then don't add threading because you're just going to add complexity and it won't be any faster. If this comparison takes a long time and you'd rather "do it in the background" then threading might be an answer for you. Just be aware that if you do this you're then adding another concern, which is, how do you update the user when the background job is done?
Threading involves extra overhead and app complexity, which you will then have to manage within your app - tread lightly
There are other concerns as well, such as, how do I schedule that worker thread to make sure it doesn't hog the computing resources? Are the setting of thread priorities an option in my environment, and if so, how will adjusting them affect the use of computing resources?
Threading and the extra overhead involved will almost definitely make your comparison take LONGER (in terms of absolute time it takes to do the comparison). The real advantage is if you don't care about completion time (the time between when the comparison starts and when it is done) but instead the responsiveness of the app to the user, and/or the total throughput that can be achieved (e.g. the number of simultaneous comparisons you can be running, and as a result the total number of comparisons you can complete within a given time span).
Threading doesn't guarantee that your available CPU cores are used efficiently
See Green Threads vs. native threads - some languages (depending on their threading implementation) can schedule threads across CPUs.
Threading doesn't necessarily mean your threads wind up getting run in multiple physical CPU cores - in fact in many cases they definitely won't. If all your app's threads run on the same physical core, then they aren't truly running in parallel - they are just splitting CPU time in a way that may make them look like they are running in parallel.
For these reasons, depending on the structure of your app, it's often less complicated to send background tasks to a separate worker process (process, not thread), which can easily be scheduled onto available CPU cores at the OS level. Separate processes (as opposed to separate threads) also remove a lot of the scheduling concerns within your app, because you essentially offload the decision about how to schedule things onto the OS itself.
This last point is pretty important. OS schedulers are extremely likely to be smarter and more efficiently designed than whatever algorithm you might come up with in your app.

Cilk or Cilk++ or OpenMP

I'm creating a multi-threaded application in Linux. here is the scenario:
Suppose I am having x instance of a class BloomFilter and I have some y GB of data(greater than memory available). I need to test membership for this y GB of data in each of the bloom filter instance. It is pretty much clear that parallel programming will help to speed up the task moreover since I am only reading the data so it can be shared across all processes or threads.
Now I am confused about which one to use Cilk, Cilk++ or OpenMP(which one is better). Also I am confused about which one to go for Multithreading or Multiprocessing

Cilk Plus is the current implementation of Cilk by Intel.
They both are multithreaded environment, i.e., multiple threads are spawned during execution.
If you are new to parallel programming probably OpenMP is better for you since it allows an easier parallelization of already developed sequential code. Do you already have a sequential version of your code?
OpenMP uses pragma to instruct the compiler which portions of the code has to run in parallel. If I understand your problem correctly you probably need something like this:
#pragma omp parallel for firstprivate(array_of_bloom_filters)
for i in DATA:
check(i,array_of_bloom_filters);
the instances of different bloom filters are replicated in every thread in order to avoid contention while data is shared among thread.
update:
The paper actually consider an application which is very unbalanced, i.e., different taks (allocated on different thread) may incur in very different workload. Citing the paper that you mentioned "a highly unbalanced task graph that challenges scheduling,
load balancing, termination detection, and task coarsening strategies". Consider that in order to balance computation among threads it is necessary to reduce the task size and therefore increase the time spent in synchronizations.
In other words, good load balancing comes always at a cost. The description of your problem is not very detailed but it seems to me that the problem you have is quite balanced. If this is not the case then go for Cilk, its work stealing approach its probably the best solution for unbalanced workloads.

At the time this was posted, Intel was putting a lot of effort into boosting Cilk(tm) Plus; more recently, some effort has been diverted toward OpenMP 4.0.
It's difficult in general to contrast OpenMP with Cilk(tm) Plus.
If it's not possible to distribute work evenly across threads, one would likely set schedule(runtime) in an OpenMP version, and then at run time try various values of environment variable, such as OMP_SCHEDULE=guided, OMP_SCHEDULE=dynamic,2 or OMP_SCHEDULE=auto. Those are the closest OpenMP analogies to the way Cilk(tm) Plus work stealing works.
Some sparse matrix functions in Intel MKL library do actually scan the job first and determine how much to allocate to each thread so as to balance work. For this method to be useful, the time spent in serial scanning and allocating has to be of lower order than the time spent in parallel work.
Work-stealing, or dynamic scheduling, may lose much of the potential advantage of OpenMP in promoting cache locality by pinning threads with cache locality e.g. by OMP_PROC_BIND=close.
Poor cache locality becomes a bigger issue on a NUMA architecture where it may lead to significant time spent on remote memory access.
Both OpenMP and Cilk(tm) Plus have facilities for switching between serial and parallel execution.

Are there any concurrent algorithms that in use that work correctly without any synchronization?

All of the concurrent programs I've seen or heard details of (admittedly a small set) at some point use hardware synchronization features, generally some form of compare-and-swap. The question is: are there any concurrent programs in the wild where the thread interact throughout there life and get away without any synchronization?
Example of what I'm thinking of include:
A program that amounts to a single thread running a yes/no test on a large set of cases and a big pile of threads tagging cases based on a maybe/no tests. This doesn't need synchronization because dirty data will only effect performance rather than correctness.
A program that has many threads updating a data structure where any state that is valid now, will always be valid, so dirty reads or writes don't invalidate anything. An example of this is (I think) path compression in the union-find algorithm.

If you can break work up into completely independent chunks, then yes there are concurrent algorithms whose only synchronisation point is the one at the end of the work where all threads join. Parallel speedup is then a factor of being able to break into tasks whose sizes are as similiar as possible.

Some indirect methods for solving systems of linear equations, like Successive over-relaxation ( http://en.wikipedia.org/wiki/Successive_over-relaxation ), don't really need the iterations to be synchronized.

I think it's a bit trick question because e.g. if you program in C, malloc() must be multi-thread safe and uses hardware synchronization, and in Java the garbage collector requires hardware synchronization anyway. All Java programs require the GC, and hardly any C program makes it without malloc() (or C++ program / new() operator).

There is a whole class of algorithms which are sometimes referred to as "embarallel" (contraction of "embarrassingly parallel"). Many image processing algorithms fall into this class, where each pixel may be processed independently (which makes implementation with e.g. SIMD or GPGPU very straightforward).

Well, without any synchronization at all (even at the end of the algorithm) you obviously can't do anything useful because you can't even transfer the results of concurrent computations to the main thread: suppose that they were on remote machines without any communication channels to the main machine.

The simplest example is inside java.lang.String which is immutable and lazily caches its hash code. This cache is written to without synchronization because (a) its cheaper, (b) the value is recomputable, and (c) JVM guarantees no tearing. The tolerance of data races in purely functional contexts allows tricks like this to be used safely without explicit synchronization.

I agree with Mitch's answer. I would like to add that the ray tracing algorithm can work without synchronization until the point where all threads join.

How to determine the optimum number of worker threads

I wrote a C program which reads a dataset from a file and then applies a data mining algorithm to find the clusters and classes in the data. At the moment I am trying to rewrite this sequential program multithreaded with PThreads and I am newbie to a parallel programming and I have a question about the number of worker threads which struggled my mind:
What is the best practice to find the number of worker threads when you do parallel programming and how do you determine it? Do you try different number of threads and see its results then determine or is there a procedure to find out the optimum number of threads. Of course I'm investigating this question from the performance point of view.

There are a couple of issues here.
As Alex says, the number of threads you can use is application-specific. But there are also constraints that come from the type of problem you are trying to solve. Do your threads need to communicate with one another, or can they all work in isolation on individual parts of the problem? If they need to exchange data, then there will be a maximum number of threads beyond which inter-thread communication will dominate, and you will see no further speed-up (in fact, the code will get slower!). If they don't need to exchange data then threads equal to the number of processors will probably be close to optimal.
Dynamically adjusting the thread pool to the underlying architecture for speed at runtime is not an easy task! You would need a whole lot of additional code to do runtime profiling of your functions. See for example the way FFTW works in parallel. This is certainly possible, but is pretty advanced, and will be hard if you are new to parallel programming. If instead the number of cores estimate is sufficient, then trying to determine this number from the OS at runtime and spawning your threads accordingly will be a much easier job.
To answer your question about technique: Most big parallel codes run on supercomputers with a known architecture and take a long time to run. The best number of processors is not just a function of number, but also of the communication topology (how the processors are linked). They therefore benefit from a testing phase where the best number of processors is determined by measuring the time taken on small problems. This is normally done by hand. If possible, profiling should always be preferred to guessing based on theoretical considerations.

You basically want to have as many ready-to-run threads as you have cores available, or at most 1 or 2 more to ensure no core that's available to you will ever be left idle. The trick is in estimating how many threads will typically be blocked waiting for something else (mostly I/O), as that is totally dependent on your application and even on external entities beyond your control (databases, other distributed services, etc, etc).
In the end, once you've determined about how many threads should be optimal, running benchmarks for thread pool sizes around your estimated value, as you suggest, is good practice (at the very least, it lets you double check your assumptions), especially if, as it appears, you do need to get the last drop of performance out of your system!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio