How many tasks in parallel - parallel-processing

I built a sample program to check the performance of tasks in parallel, with respect to the number of tasks running in parallel.
Few assumptions:
Operation is on thread is independent of another thread, so no synchronization mechanisms between threads are essential.
The idea is to check, whether it is efficient to:
1.Spawn as many tasks as possible
or
2. Restrict the number of tasks in parallel, and wait for some tasks to complete before spawning the remaining tasks.
Following is the program:
static void Main(string[] args)
{
System.IO.StreamWriter writer = new System.IO.StreamWriter("C:\\TimeLogV2.csv");
SemaphoreSlim availableSlots;
for (int slots = 10; slots <= 20000; slots += 10)
{
availableSlots = new SemaphoreSlim(slots, slots);
int maxTasks;
CountdownEvent countDownEvent;
Stopwatch watch = new Stopwatch();
watch.Start();
maxTasks = 20000;
countDownEvent = new CountdownEvent(maxTasks);
for (int i = 0; i < maxTasks; i++)
{
Console.WriteLine(i);
Task task = new Task(() => Thread.Sleep(50));
task.ContinueWith((t) =>
{
availableSlots.Release();
countDownEvent.Signal();
}
);
availableSlots.Wait();
task.Start();
}
countDownEvent.Wait();
watch.Stop();
writer.WriteLine("{0},{1}", slots, watch.ElapsedMilliseconds);
Console.WriteLine("{0}:{1}", slots, watch.ElapsedMilliseconds);
}
writer.Flush();
writer.Close();
}
Here are the results:
The Y-axis is time-taken in milliseconds, X-axis is the number of semaphore slots (refer to above program)
Essentially the trend is: More number of parallel tasks, the better. Now my question is in what conditions, does:
More number of parallel tasks = less optimal (time taken)?
One condition I suppose is that:
The tasks are interdependent, and may have to wait for certain resources to be available.
Have you in any scenario limited the number of parallel tasks?

The TPL will control how many threads are running at once - basically you're just queuing up tasks to be run on those threads. You're not really running all those tasks in parallel.
The TPL will use work-stealing queues to make it all as efficient as possible. If you have all the information about what tasks you need to run, you might as well queue them all to start with, rather than trying to micro-manage it yourself. Of course, this will take memory - so that might be a concern, if you have a huge number of tasks.
I wouldn't try to artificially break your logical tasks into little bits just to get more tasks, however. You shouldn't view "more tasks == better" as a general rule.
(I note that you're including time taken to write a lot of lines to the console in your measurements, by the way. I'd remove those Console.WriteLine calls and try again - they may well make a big difference.)

Related

Split a loop into smaller parts

I have a function inside a loop that takes a lot of resources and usually does not get completed unless the server is on low load.
How can I split it into smaller loops? When decreasing the value the function executes fine.
As an example, this works:
x = 10
for i = 0; i <= x; i++
{
myfunction(i)
}
However, when increasing x to 100 the memory hogging function stops working.
How can one split 100 in chunks of 10 (i.e.) and run the loop 10 times?
I should be able to use any value, not only 100 or multiples of 10.
Thank you.
Is your function asynchronous and you're having too many instances at once? Is it opening and not closing resources? Perhaps you could put a delay in after every 10 iterations?
x = 1000
for i = 0; i <= x; i++
{
myfunction(i);
if(i%10==0)
{
Thread.Sleep(1000);
}
}
If your task is asynchronous you can use worker thread pool technique.
Ex
You can create thread pool with 10 thread.
First you assign 10 tasks to them.
Whenever a task is finished you can assign remaining task to the given thread.

Parallel Stream non-concurrent unordered collector

Suppose I have this custom collector :
public class CustomToListCollector<T> implements Collector<T, List<T>, List<T>> {
#Override
public Supplier<List<T>> supplier() {
return ArrayList::new;
}
#Override
public BiConsumer<List<T>, T> accumulator() {
return List::add;
}
#Override
public BinaryOperator<List<T>> combiner() {
return (l1, l2) -> {
l1.addAll(l2);
return l1;
};
}
#Override
public Function<List<T>, List<T>> finisher() {
return Function.identity();
}
#Override
public Set<java.util.stream.Collector.Characteristics> characteristics() {
return EnumSet.of(Characteristics.IDENTITY_FINISH, Characteristics.UNORDERED);
}
}
This is exactly the Collectors#toList implementation with one minor difference: there's also UNORDERED characteristics added.
I would assume that running this code :
List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8);
for (int i = 0; i < 100_000; i++) {
List<Integer> result = list.parallelStream().collect(new CustomToListCollector<>());
if (!result.equals(list)) {
System.out.println(result);
break;
}
}
should actually produce some result. But it does not.
I've looked under the hood a bit. ReferencePipeline#collect first checks if the stream is parallel, if the collector is concurrent and if the collector is unordered. Concurrent is missing, so it delegates to a method evaluate by creating a TerminalOp out of this collector. This under the hood is a ReducingSink, that actually cares if the collector is unordered or not:
return new ReduceOp<T, I, ReducingSink>(StreamShape.REFERENCE) {
#Override
public ReducingSink makeSink() {
return new ReducingSink();
}
#Override
public int getOpFlags() {
return collector.characteristics().contains(Collector.Characteristics.UNORDERED)
? StreamOpFlag.NOT_ORDERED
: 0;
}
};
I have not debugged further since it gets pretty complicated fast.
Thus may be there is a shortcut here and someone could explain what I am missing. It is a parallel stream that collects elements in a non-concurrent unordered collector. Shouldn't there be no order in how the threads combine the results together? If not, how is the order imposed here (by whom)?
Note that the result is the same when using list .parallelStream() .unordered() .collect(Collectors.toList()), in either case, the unordered property is not used within the current implementation.
But let’s change the setup a little bit:
List<Integer> list = Collections.nCopies(10, null).stream()
.flatMap(ig -> IntStream.range(0, 100).boxed())
.collect(Collectors.toList());
List<Integer> reference = new ArrayList<>(new LinkedHashSet<>(list));
for (int i = 0; i < 100_000; i++) {
List<Integer> result = list.parallelStream()
.distinct()
.collect(characteristics(Collectors.toList(), Collector.Characteristics.UNORDERED));
if (!result.equals(reference)) {
System.out.println(result);
break;
}
}
using the characteristics collector factory of this answer
The interesting outcome is that in Java 8 versions prior to 1.8.0_60, this has a different outcome. If we use objects with distinct identities instead of the canonical Integer instance, we could detect that in these earlier versions, not only the order of the list differs, but that the objects in the result list are not the first encountered instances.
So the unordered characteristic of a terminal operation was propagated to the stream, affecting the behavior of distinct(), similar to that of skip and limit, as discussed here and here.
As discussed in the second linked thread, the back-propagation has been removed completely, which is reasonable when thinking about it a second time. For distinct, skip and limit, the order of the source is relevant and ignoring it just because the order will be ignored in subsequent stages is not right. So the only remaining stateful intermediate operation that could benefit from back-propagation would be sorted, which would be rendered obsolete when the order is being ignored afterwards. But combining sorted with an unordered sink is more like a programming error anyway…
For stateless intermediate operations the order is irrelevant anyway. The stream processing works by splitting the source into chunks, apply all stateless intermediate operations on their elements independently and collecting into a local container, before merging into the result container. So the merging step is the only place, where respecting or ignoring the order (of the chunks) will have an impact on the result and perhaps on the performance.
But the impact isn’t very big. When you implement such an operation, e.g. via ForkJoinTasks, you simply split a task into two, wait for their completion and merge them. Alternatively, a task may split off a chunk into a sub-task, process its remaining chunk in-place, wait for the sub-task and merge. In either case, merging the results in order comes naturally due to the fact that the initiating task has hands on references to the adjacent tasks. To merge with different chunks instead, the associated sub-tasks first have to be found somehow.
The only benefit from merging with a different task would be that you can merge with the first completed task, if the tasks need different time to complete. But when waiting for a sub-task in the Fork/Join framework, the thread won’t be idle, the framework will use the thread for working on other pending tasks in-between. So as long as the main task has been split into enough sub-tasks, there will be full CPU utilization. Also, the spliterators attempt to split into even chunks to reduce the differences between the computing times. It’s very likely that the benefit of an alternative unordered merging implementation doesn’t justify the code duplication, at least with the current implementation.
Still, reporting an unordered characteristic allows the implementation to utilize it when beneficial and implementations can change.
This is not an actual answer per-se, but if I add more code and comments, it will get too many I guess.
Here is another interesting thing, actually it made me realize I was wrong in comments.
A spliterator flags need to be merged with all the terminal operation flags and intermediate ones.
Our spliterator's flags are (as reported by StreamOpFlags) : 95; this can be debugged from AbstractSpliterator#sourceSpliterator(int terminalFlags).
That is why the line below reports true:
System.out.println(StreamOpFlag.ORDERED.isKnown(95)); // true
At the same time our terminal collector's characteristics are 32:
System.out.println(StreamOpFlag.ORDERED.isKnown(32)); // false
The result:
int result = StreamOpFlag.combineOpFlags(32, 95); // 111
System.out.println(StreamOpFlag.ORDERED.isKnown(result)); // false
If you think about this, it makes complete sense. List has order, my custom collector does not => order is not preserved.
Bottom-line: that UNORDERED flag is preserved in the resulting Stream, but internally nothing is done with it. They could probably, but they choose not to.

Use of pthread increases execution time, suggestions for improvements

I had a piece of code, which looked like this,
for(i=0;i<NumberOfSteps;i++)
{
for(k=0;k<NumOfNodes;k++)
{
mark[crawler[k]]++;
r = rand() % node_info[crawler[k]].num_of_nodes;
crawler[k] = (int)DataBlock[node_info[crawler[k]].index+r][0];
}
}
I changed it so that the load can be split among multiple threads. Now it looks like this,
for(i=0;i<NumberOfSteps;i++)
{
for(k=0;k<NumOfNodes;k++)
{
pthread_mutex_lock( &mutex1 );
mark[crawler[k]]++;
pthread_mutex_unlock( &mutex1 );
pthread_mutex_lock( &mutex1 );
r = rand() % node_info[crawler[k]].num_of_nodes;
pthread_mutex_unlock( &mutex1 );
pthread_mutex_lock( &mutex1 );
crawler[k] = (int)DataBlock[node_info[crawler[k]].index+r][0];
pthread_mutex_unlock( &mutex1 );
}
}
I need the mutexes to protect shared variables. It turns out that my parallel code is slower. But why ? Is it because of the mutexes ?
Could this possibly be something to do with the cacheline size ?
You are not parallelizing anything but the loop heads. Everything between lock and unlock is forced to be executed sequentially. And since lock/unlock are (potentially) expensive operations, the code is getting slower.
To fix this, you should at least separate expensive computations (without mutex protection) from access to shared data areas (with mutexes). Then try to move the mutexes out of the inner loop.
You could use atomic increment instructions (depends on platform) instead of plain '++', which is generally cheaper than mutexes. But beware of doing this often on data of a single cache line from different threads in parallel (see 'false sharing').
AFAICS, you could rewrite the algorithm as indicated below with out needing mutexes and atomic increment at all. getFirstK() is NumOfNodes/NumOfThreads*t if NumOfNodes is an integral multiple of NumOfThreads.
for(t=0;t<NumberOfThreads;t++)
{
kbegin = getFirstK(NumOfNodes, NumOfThreads, t);
kend = getFirstK(NumOfNodes, NumOfThreads, t+1);
// start the following in a separate thread with kbegin and kend
// copied to thread local vars kbegin_ and kend_
int k, i, r;
unsigned state = kend_; // really bad seed
for(k=kbegin_;k<kend_;k++)
{
for(i=0;i<NumberOfSteps;i++)
{
mark[crawler[k]]++;
r = rand_r(&state) % node_info[crawler[k]].num_of_nodes;
crawler[k] = (int)DataBlock[node_info[crawler[k]].index+r][0];
}
}
}
// wait for threads/jobs to complete
This way to generate random numbers may lead to bad random distributions, see this question for details.

openMP How to get better work balance?

I'm working on an program which has to do a computation foobar upon many files, foobar can be done either in parallel or in sequential on one file, the program will receive many files (which can be of different size !) and apply the computation foobar either in parallel or sequentially on each of them with a specified number of threads.
Here is how the program is launch on 8 files with three threads.
./program 3 file1 file2 file3 file4 file5 file6 file7 file8
The default scheduling that i've implement is to affect in parallel one thread on each file to do the computation (that's how my program works now !).
Edition : Here is the default scheduling that I'm using
#pragma omp parallel for private(i) schedule(guided,1)
for (i = 0; i < nbre_file; i++)
foobar(files[i]); // according to the size of files(i) foobar can react as a sequential or a parallel program (which could induce nested loops)
See the image below
In the image above the final time is the time spend to solve foobar sequentially on the biggest file file8.
I think that a better scheduling which will effectivelly deal with work balance could be to apply the computation foobar on big file in parallel. Like in the image below where tr i represent a thread.
such a way that the final time will be the one spend to solve foobar in parallel (in the image above we have used two threads !) on the biggest file file8
My question is :
it's possible to do such a scheduling with openmp ?
Thanks for any reply !
Have you tried dynamic scheduling instead of guided?
If the normal scheduling clauses did not work for you you can try to do the parallelization of the loop by hand and assign the files to certain threads by hand. So your loop would look like this:
#pragma omp parallel
{
id = omp_get_thread_num();
if(id==0){ //thread 0
for (i = 0; i < nbre_of_small_files; i++)
foobar(files[i]);
}
else { //thread 1 and 2
for (j = 0; j < nbre_of_big_files; j=j+2)
if(id==1){//thread 1
foobar(files[j]);
}
else{ //thread 2
foobar(files[j+1]);
}
}
}
Here thread 0 does all the small files. Thread two and three do the big files.

why the sum of "Functions With Most Individual Work" can't be more than 100%?

I'm using VS2010 built-in profilier
My application contains three threads.
One of the thread is really simple:
while (true)
if (something) {
// blah blah, very fast and rarely occuring thing
}
Thread.sleep(1000);
}
Visual Studio reports that Thread.sleep takes 36% of the program time.
The question is "why not ~100% of the time?" Why Main methods takes 40% of the time, I definitely was inside this method durring application execution from start to end.
Do profiler devides the result to the number of the threads?
On my another thread I've observed that method takes 34% of the time.
What does it mean? Does it mean that it works only 34% of the time or it works almost all the time?
In my opinion if I have three threads that run in parallel, and if I sum methods time I should get 300% (if application runs for 10 seconds for example, this means that each thread runs for 10 seconds, and if there are 3 threads - it would be 30 seconds totally)
The question is what do you measuring and how you do it. From your question I'm unable to repeat your experience actually...
Thread.Sleep() call takes very small amount of time itself. Its task is to call native function from WinAPI that will command scheduler (responsible for dividing processor time between threads) that user thread it was called from should not be scheduled for the next second at all. After that this thread doesn't receive processor time until this second is over.
But thread do not takes any bit of processor time in that state. I'm not sure how this situation is reported by profiler.
Here is the code I was experimenting with:
internal class Program
{
private static int x = 0;
private static void A()
{
// Just to have something in the profiler here
Console.WriteLine("A");
}
private static void Main(string[] args)
{
var t = new Thread(() => { while (x == 0) Thread.MemoryBarrier(); });
t.Start();
while (true)
{
if (DateTime.Now.Millisecond%3 == 0)
A();
Thread.Sleep(1000);
}
}
}

Resources