openMP How to get better work balance? - parallel-processing

I'm working on an program which has to do a computation foobar upon many files, foobar can be done either in parallel or in sequential on one file, the program will receive many files (which can be of different size !) and apply the computation foobar either in parallel or sequentially on each of them with a specified number of threads.
Here is how the program is launch on 8 files with three threads.
./program 3 file1 file2 file3 file4 file5 file6 file7 file8
The default scheduling that i've implement is to affect in parallel one thread on each file to do the computation (that's how my program works now !).
Edition : Here is the default scheduling that I'm using
#pragma omp parallel for private(i) schedule(guided,1)
for (i = 0; i < nbre_file; i++)
foobar(files[i]); // according to the size of files(i) foobar can react as a sequential or a parallel program (which could induce nested loops)
See the image below
In the image above the final time is the time spend to solve foobar sequentially on the biggest file file8.
I think that a better scheduling which will effectivelly deal with work balance could be to apply the computation foobar on big file in parallel. Like in the image below where tr i represent a thread.
such a way that the final time will be the one spend to solve foobar in parallel (in the image above we have used two threads !) on the biggest file file8
My question is :
it's possible to do such a scheduling with openmp ?
Thanks for any reply !

Have you tried dynamic scheduling instead of guided?
If the normal scheduling clauses did not work for you you can try to do the parallelization of the loop by hand and assign the files to certain threads by hand. So your loop would look like this:
#pragma omp parallel
{
id = omp_get_thread_num();
if(id==0){ //thread 0
for (i = 0; i < nbre_of_small_files; i++)
foobar(files[i]);
}
else { //thread 1 and 2
for (j = 0; j < nbre_of_big_files; j=j+2)
if(id==1){//thread 1
foobar(files[j]);
}
else{ //thread 2
foobar(files[j+1]);
}
}
}
Here thread 0 does all the small files. Thread two and three do the big files.

Related

Split a loop into smaller parts

I have a function inside a loop that takes a lot of resources and usually does not get completed unless the server is on low load.
How can I split it into smaller loops? When decreasing the value the function executes fine.
As an example, this works:
x = 10
for i = 0; i <= x; i++
{
myfunction(i)
}
However, when increasing x to 100 the memory hogging function stops working.
How can one split 100 in chunks of 10 (i.e.) and run the loop 10 times?
I should be able to use any value, not only 100 or multiples of 10.
Thank you.
Is your function asynchronous and you're having too many instances at once? Is it opening and not closing resources? Perhaps you could put a delay in after every 10 iterations?
x = 1000
for i = 0; i <= x; i++
{
myfunction(i);
if(i%10==0)
{
Thread.Sleep(1000);
}
}
If your task is asynchronous you can use worker thread pool technique.
Ex
You can create thread pool with 10 thread.
First you assign 10 tasks to them.
Whenever a task is finished you can assign remaining task to the given thread.

Difference between OMP_NUM_THREADS and OMP_THREAD_LIMIT [duplicate]

Heluuu,
I have a rather large program that I'm attempting to thread. So far, this has been succesful, and the basics are all working as intended.
I now want to do some fancy work with cascading threads in nested mode. Essentially, I want the main parallel region to use any free threads in lower parallel regions.
To detail the current system, the main parallel region starts 10 threads. I have 12 cores, so I can use 2 more threads. There is a second parallel region where some heavy computing happens, and I want the first two threads to reach this point to start a new team there, each with 2 threads. Every new entry to the lower parallel region after this will continue in serial.
So, this should look like the following.
Main region: 10 threads started.
Lower region: 2 new threads started.
Thread 1: 2 threads in lower region.
Thread 2: 2 threads in lower region.
Thread 3-10: 1 thread in lower region.
Please keep in mind that these numbers are for the sake of clarity in providing a concrete description of my situation, and not the absolute and only case in which the program operates.
The code:
main() {
...
...
omp_set_num_threads(n);
omp_set_dynamic(x);
#pragma omp parallel
{
#pragma omp for
for (int i = 0; i < iterations; i++) {
...
Compute();
...
}
}
}
And in Compute
bool Compute() {
...
float nThreads = omp_get_thread_limit() - omp_get_num_threads();
nThreads = ceil(nThreads / omp_get_num_threads());
omp_set_num_threads((int)nThreads);
#pragma omp parallel
{
...
#pragma omp for
for (int i = 0; i < nReductSize; i++) {
...
}
}
}
Now, my problem is that setting the uppermost limit for the whole program (i.e. OMP_THREAD_LIMIT) only works from outside the program. Using
export OMP_THREAD_LIMIT=5
from the bash command line works great. But I want to do it internally. So far, I've tried
putenv("OMP_THREAD_LIMIT=12");
setenv("OMP_THREAD_LIMIT", "12", 1);
but when I call omp_get_thread_limit() or getenv("OMP_THREAD_LIMIT") I get wacky return values. Even when I set the variable with export, calling getenv("OMP_THREAD_LIMIT"); returns 0.
So, I would ask for your help in this: How do I properly set OMP_THREAD_LIMIT at runtime?
This is the main function where I set the thread defaults. It is executed well before any threading occurs:
#ifdef _OPENMP
const char *name = "OMP_THREAD_LIMIT";
const char *value = "5";
int overwrite = 1;
int success = setenv(name, value, overwrite);
cout << "Var set (0 is success): " << success << endl;
#endif
Oh, and setenv reports success in setting the variable.
Compiler says
gcc44 (GCC) 4.4.7 20120313 (Red Hat 4.4.7-1)
Flags
CCFLAGS = -c -O0 -fopenmp -g -msse -msse2 -msse3 -mfpmath=sse -std=c++0x
OpenMP version is 3.0.
This is correct implementation of OpenMP, and it ignores changes in environment from inside the program. As stated in OpenMP 3.1 Standard, page 159:
Modifications to the environment variables after the program has started, even if
modified by the program itself, are ignored by the OpenMP implementation.
You are doing exactly what is said in this paragraph.
OpenMP allows changing of such parameters only via omp_set_* functions, but there are no such function for thread-limit-var ICV:
However, the settings of some of the ICVs can be modified during the execution of the OpenMP
program by the use of the appropriate directive clauses or OpenMP API routines.
I think, you may use num_threads clause of #pragma omp parallel to achieve what you want.
Changing the behavior of OpenMP using OMP_THREAD_LIMIT (or any other OMP_* environment variable) is not possible after the program has started; these are intended for use by the user. You could have the user invoke your program through a script that sets OMP_THREAD_LIMIT and then calls your program, but that's probably not what you need to do in this case.
OMP_NUM_THREADS, omp_set_num_threads, and the num_threads clause are usually used to set the number of threads operating in a region.
It might be offtopic, but you may want to try openmp collapse instead of handcrafting here.

Perl fast matrix multiply

I have implemented the following statistical computation in perl http://en.wikipedia.org/wiki/Fisher_information.
The results are correct. I know this because I have 100's of test cases that match input and output. The problem is that I need to compute this many times every single time I run the script. The average number of calls to this function is around 530. I used Devel::NYTProf to find out this out as well as where the slow parts are. I have optimized the algorithm to only traverse the top half of the matrix and reflect it onto the bottom as they are the same. I'm not a perl expert, but I need to know if there is anything I can try to speed up the perl. This script is distributed to clients so compiling a C file is not an option. Is there another perl library I can try? This needs to be sub second in speed if possible.
More information is $MatrixRef is a matrix of floating point numbers that is $rows by $variables. Here is the NYTProf dump for the function.
#-----------------------------------------------
#
#-----------------------------------------------
sub ComputeXpX
# spent 4.27s within ComputeXpX which was called 526 times, avg 8.13ms/call:
# 526 times (4.27s+0s) by ComputeEfficiency at line 7121, avg 8.13ms/call
{
526 0s my ($MatrixRef, $rows, $variables) = #_;
526 0s my $r = 0;
526 0s my $c = 0;
526 0s my $k = 0;
526 0s my $sum = 0;
526 0s my #xpx = ();
526 11.0ms for ($r = 0; $r < $variables; $r++)
{
14202 19.0ms my #temp = (0) x $variables;
14202 6.01ms push(#xpx, \#temp);
526 0s }
526 7.01ms for ($r = 0; $r < $variables; $r++)
{
14202 144ms for ($c = $r; $c < $variables; $c++)
{
198828 43.0ms $sum = 0;
#for ($k = 0; $k < $rows; $k++)
198828 101ms foreach my $RowRef (#{$MatrixRef})
{
#$sum += $MatrixRef->[$k]->[$r]*$MatrixRef->[$k]->[$c];
6362496 3.77s $sum += $RowRef->[$r]*$RowRef->[$c];
}
198828 80.1ms $xpx[$r]->[$c] = $sum;
#reflect on other side of matrix
198828 82.1ms $xpx[$c]->[$r] = $sum if ($r != $c);
14202 1.00ms }
526 2.00ms }
526 2.00ms return \#xpx;
}
Since each element of the result matrix can be calculated independently, it should be possible to calculate some/all of them in parallel. In other words, none of the instances of the innermost loop depend on the results of any other, so they could run simultaneously on their own threads.
There really isn't much you can do here, without rewriting parts in C, or moving to a better framework for mathematic operations than bare-bone Perl (→ PDL!).
Some minor optimization ideas:
You initialize #xpx with arrayrefs containing zeros. This is unneccessary, as you assign a value to every position either way. If you want to pre-allocate array space, assign to the $#array value:
my #array;
$#array = 100; # preallocate space for 101 scalars
This isn't generally useful, but you can benchmark with and without.
Iterate over ranges; don't use C-style for loops:
for my $c ($r .. $variables - 1) { ... }
Perl scalars aren't very fast for math operations, so offloading the range iteration to lower levels will gain a speedup.
Experiment with changing the order of the loops, and toy around with caching a level of array accesses. Keeping $my $xpx_r = $xpx[$r] around in a scalar will reduce the number of array accesses. If your input is large enough, this translates into a speed gain. Note that this only works when the cached value is a reference.
Remember that perl does very few “big” optimizations, and that the opcode tree produced by compilation closely resembles your source code.
Edit: On threading
Perl threads are heavyweight beasts that literally clone the current interpreter. It is very much like forking.
Sharing data structures across thread boundaries is possible (use threads::shared; my $variable :shared = "foo") but there are various pitfalls. It is cleaner to pass data around in a Thread::Queue.
Splitting the calculation of one product over multiple threads could end up with your threads doing more communication than calculation. You could benchmark a solution that divides responsibility for certain rows between the threads. But I think recombining the solutions efficiently would be difficult here.
More likely to be useful is to have a bunch of worker threads running from the beginning. All threads listen to a queue which contains a pair of a matrix and a return queue. The worker would then dequeue a problem, and send back the solution. Multiple calculations could be run in parallel, but a single matrix multiplication will be slower. Your other code would have to be refactored significantly to take advantage of the parallelism.
Untested code:
use strict; use warnings; use threads; use Thread::Queue;
# spawn worker threads:
my $problem_queue = Thread::Queue->new;
my #threads = map threads->new(\&worker, $problem_queue), 1..3; # make 3 workers
# automatically close threads when program exits
END {
$problem_queue->enqueue((undef) x #threads);
$_->join for #threads;
}
# This is the wrapper around the threading,
# and can be called exactly as ComputeXpX
sub async_XpX {
my $return_queue = Thread::Queue->new();
$problem_queue->enqueue([$return_queue, #_]);
return sub { $return_queue->dequeue };
}
# The main loop of worker threads
sub worker {
my ($queue) = #_;
while(defined(my $problem = $queue->dequeue)) {
my ($return, #args) = #$problem;
$return->enqueue(ComputeXpX(#args));
}
}
sub ComputeXpX { ... } # as before
The async_XpX returns a coderef that will eventually collect the result of the computation. This allows us to carry on with other stuff until we need the result.
# start two calculations
my $future1 = async_XpX(...);
my $future2 = async_XpX(...);
...; # do something else
# collect the results
my $result1 = $future1->();
my $result2 = $future2->();
I benchmarked the bare-bones threading code without doing actual calculations, and the communication is about as expensive as the calculations. I.e. with a bit of luck, you may start to get a benefit on a machine with at least four processors/kernel threads.
A note on profiling threaded code: I know of no way to do that elegantly. Benchmarking threaded code, but profiling with single-threaded test cases may be preferable.

How are firstprivate and lastprivate different than private clauses in OpenMP?

I've looked at the official definitions, but I'm still quite confused.
firstprivate: Specifies that each thread should have its own instance of a variable, and that the variable should be initialized with the value of the variable, because it exists before the parallel construct.
To me, that sounds a lot like private. I've looked for examples, but I don't seem to understand how it's special or how it can be used.
lastprivate: Specifies that the enclosing context's version of the variable is set equal to the private version of whichever thread executes the final iteration (for-loop construct) or last section (#pragma sections).
I feel like I understand this one a bit better because of the following example:
#pragma omp parallel
{
#pragma omp for lastprivate(i)
for (i=0; i<n-1; i++)
a[i] = b[i] + b[i+1];
}
a[i]=b[i];
So, in this example, I understand that lastprivate allows for i to be returned outside of the loop as the last value it was.
I just started learning OpenMP today.
private variables are not initialised, i.e. they start with random values like any other local automatic variable (and they are often implemented using automatic variables on the stack of each thread). Take this simple program as an example:
#include <stdio.h>
#include <omp.h>
int main (void)
{
int i = 10;
#pragma omp parallel private(i)
{
printf("thread %d: i = %d\n", omp_get_thread_num(), i);
i = 1000 + omp_get_thread_num();
}
printf("i = %d\n", i);
return 0;
}
With four threads it outputs something like:
thread 0: i = 0
thread 3: i = 32717
thread 1: i = 32717
thread 2: i = 1
i = 10
(another run of the same program)
thread 2: i = 1
thread 1: i = 1
thread 0: i = 0
thread 3: i = 32657
i = 10
This clearly demonstrates that the value of i is random (not initialised) inside the parallel region and that any modifications to it are not visible after the parallel region (i.e. the variable keeps its value from before entering the region).
If i is made firstprivate, then it is initialised with the value that it has before the parallel region:
thread 2: i = 10
thread 0: i = 10
thread 3: i = 10
thread 1: i = 10
i = 10
Still modifications to the value of i inside the parallel region are not visible after it.
You already know about lastprivate (and it is not applicable to the simple demonstration program as it lacks worksharing constructs).
So yes, firstprivate and lastprivate are just special cases of private. The first one results in bringing in values from the outside context into the parallel region while the second one transfers values from the parallel region to the outside context. The rationale behind these data-sharing classes is that inside the parallel region all private variables shadow the ones from the outside context, i.e. it is not possible to use an assignment operation to modify the outside value of i from inside the parallel region.
You cannot use local variable i before initialization, the program will give an error since C++ 14 Standard.

How many tasks in parallel

I built a sample program to check the performance of tasks in parallel, with respect to the number of tasks running in parallel.
Few assumptions:
Operation is on thread is independent of another thread, so no synchronization mechanisms between threads are essential.
The idea is to check, whether it is efficient to:
1.Spawn as many tasks as possible
or
2. Restrict the number of tasks in parallel, and wait for some tasks to complete before spawning the remaining tasks.
Following is the program:
static void Main(string[] args)
{
System.IO.StreamWriter writer = new System.IO.StreamWriter("C:\\TimeLogV2.csv");
SemaphoreSlim availableSlots;
for (int slots = 10; slots <= 20000; slots += 10)
{
availableSlots = new SemaphoreSlim(slots, slots);
int maxTasks;
CountdownEvent countDownEvent;
Stopwatch watch = new Stopwatch();
watch.Start();
maxTasks = 20000;
countDownEvent = new CountdownEvent(maxTasks);
for (int i = 0; i < maxTasks; i++)
{
Console.WriteLine(i);
Task task = new Task(() => Thread.Sleep(50));
task.ContinueWith((t) =>
{
availableSlots.Release();
countDownEvent.Signal();
}
);
availableSlots.Wait();
task.Start();
}
countDownEvent.Wait();
watch.Stop();
writer.WriteLine("{0},{1}", slots, watch.ElapsedMilliseconds);
Console.WriteLine("{0}:{1}", slots, watch.ElapsedMilliseconds);
}
writer.Flush();
writer.Close();
}
Here are the results:
The Y-axis is time-taken in milliseconds, X-axis is the number of semaphore slots (refer to above program)
Essentially the trend is: More number of parallel tasks, the better. Now my question is in what conditions, does:
More number of parallel tasks = less optimal (time taken)?
One condition I suppose is that:
The tasks are interdependent, and may have to wait for certain resources to be available.
Have you in any scenario limited the number of parallel tasks?
The TPL will control how many threads are running at once - basically you're just queuing up tasks to be run on those threads. You're not really running all those tasks in parallel.
The TPL will use work-stealing queues to make it all as efficient as possible. If you have all the information about what tasks you need to run, you might as well queue them all to start with, rather than trying to micro-manage it yourself. Of course, this will take memory - so that might be a concern, if you have a huge number of tasks.
I wouldn't try to artificially break your logical tasks into little bits just to get more tasks, however. You shouldn't view "more tasks == better" as a general rule.
(I note that you're including time taken to write a lot of lines to the console in your measurements, by the way. I'd remove those Console.WriteLine calls and try again - they may well make a big difference.)

Resources