Difference between parallel map and parallel for-loop - parallel-processing

when I read the Julia document of multi-core parallel computing, I noticed there are both parallel map pmap and for-loop #distributed for.
From the documentation, "Julia's pmap is designed for the case where each function call does a large amount of work. In contrast, #distributed for can handle situations where each iteration is tiny".
What makes the difference between pmap and #distributed for? Why #distributed for is slow for a large amount of work?
Thanks

The issue is that pmap does load balancing while #distributed for splits jobs into equal chunks. You can confirm this by running these two code examples:
julia> #time res = pmap(x -> (sleep(x/10); println(x)), [10;ones(Int, 19)]);
From worker 2: 1
From worker 3: 1
From worker 4: 1
From worker 2: 1
From worker 3: 1
From worker 4: 1
From worker 3: 1
From worker 2: 1
From worker 4: 1
From worker 4: 1
From worker 2: 1
From worker 3: 1
From worker 2: 1
From worker 3: 1
From worker 4: 1
From worker 4: 1
From worker 3: 1
From worker 2: 1
From worker 4: 1
From worker 5: 10
1.106504 seconds (173.34 k allocations: 8.711 MiB, 0.66% gc time)
julia> #time #sync #distributed for x in [10;ones(Int, 19)]
sleep(x/10); println(x)
end
From worker 4: 1
From worker 3: 1
From worker 5: 1
From worker 4: 1
From worker 5: 1
From worker 3: 1
From worker 5: 1
From worker 3: 1
From worker 4: 1
From worker 3: 1
From worker 4: 1
From worker 5: 1
From worker 4: 1
From worker 5: 1
From worker 3: 1
From worker 2: 10
From worker 2: 1
From worker 2: 1
From worker 2: 1
From worker 2: 1
1.543574 seconds (184.19 k allocations: 9.013 MiB)
Task (done) #0x0000000005c5c8b0
And you can see that the large job (value 10) makes pmap execute all small jobs on workers different than the one that got the large job (in my example worker 5 did only job 10 while workers 2 to 4 did all other jobs). On the other hand #distributed for assigned the same number of jobs to each worker. Thus the worker that got job 10 (worker 2 in the second example) still had to do four short jobs (as each worker on the average has to do 5 jobs - my example has 20 jobs in total and 4 workers).
Now the advantage of #distributed for is that if the job is inexpensive then equal splitting of jobs among workers avoids having to do the dynamic scheduling which is not for free either.
In summary, as the documentation states, if the job is expensive (and especially if its run time can vary largely), it is better to use pmap as it does load-balancing.

pmap has a batch_size argument which is, by default, 1. This means that each element of the collection will be sent one by one to available workers or tasks to be transformed by the function you provided. If each function call does a large amount of work and perhaps each call differs in time it takes, using pmap has the advantage of not letting workers go idle, while other workers do work, because when a worker completes one transformation, it will ask for the next element to transform. Therefore, pmap effectively balances load among workers/tasks.
#distributed for-loop, however, partitions a given range among workers once at the beginning, not knowing how much time each partition of the range will take. Consider, for example, a collection of matrices, where the first hundred elements of the collection are 2-by-2 matrices, the next hundred elements are 1000-by-1000 matrices and we would like to take the inverse of each matrix using #distributed for-loops and 2 worker processes.
#sync #distributed for i = 1:200
B[i] = inv(A[i])
end
The first worker will get all the 2-by-2 matrices and the second one will get 1000-by-1000 matrices. The first worker will complete all the transformation very quickly and go idle, while the other will continue to do work for very long time. Although you are using 2 workers, the major part of the whole work will effectively be executed in serial on the second worker and you will get almost no benefit from using more than one worker. This problem is known as load balancing in the context of parallel computing. The problem may also arise, for example, when one processor is slow and the other is fast even if the work to be completed is homogeneous.
For very small work transformations, however, using pmap with a small batch size creates a communication overhead that might be significant since after each batch the processor needs to get the next batch from the calling process, whereas with #distributed for-loops each worker process will know, at the beginning, which part of the range it is responsible for.
The choice between pmap and #distributed for-loop depends on what you want to achieve. If you are going to transform a collection as in map and each transformation requires a large amount of work and this amount is varying, then you are likely to be better of choosing pmap. If each transformation is very tiny, then you are likely to be better of choosing #distributed for-loop.
Note that, if you need a reduction operation after the transformation, #distributed for-loop already provides one, most of the reductions will be applied locally while the final reduction will take place on the calling process. With pmap, however, you will need to handle the reduction yourself.
You can also implement your own pmap function with very complex load balancing and reduction schemes if you really need one.
https://docs.julialang.org/en/v1/manual/parallel-computing/

Related

Hungarian method variant

I have some workers and tasks, and want to assign the best workers for each task, that's a typical use for Hungarian method.
But let's add that tasks happen at certain day/time, and I want to add in consideration that for a given worker, I'd like his tasks to be as close in time as possible.
Is there an algorithm where I could set a priority between the two different goals?
edit: let's try a simple example. I have 3 tasks happening respectively at 1:00, 2:00 and 3:00. I have 2 workers for my tasks. If I consider only the quality of the work, I assign worker 1 on slots 1 and 3, and worker 2 on slot 2. But then the 1st worker has a hole in his schedule, So I'd like to put a "weight" value so that I have an equilibrium between the total quality and the will to have compact work hours.
Thanks,
Guillaume

Julia - map function status

Is there a comfy way to somehow get the 'status' of map/pmap in Julia?
If I had an array a = [1:10] i'd like to either:
1: enumerate the array and use if-conditional to add a print command
((index,value) -> 5*value ......, enumerate(a)
and where the "......." are, there would be a way to 'chain' the anonymous function to something like
"5*value and then print index/length(a) if index%200 == 0"
2: know if there is already an existing option for this, as pmap is intended for parallel tasks, which are usually used for large processes so it would make sence for this to already exist?
Additionally, is there a way to make anonymous functions do two 'separate' things one after the other?
Example
if I have
a = [1:1000]
function f(n) #something that takes a huge ammount of time
end
and I execute
map(x -> f(x), a)
the REPL would print out the status
"0.1 completed"
.
.
.
"0.9 completed"
Solution
Chris Rackauckas answer
A bitt odd that the ProgressMeter package doesnt include this by default
Pkg.add("ProgressMeter")
Pkg.clone("https://github.com/slundberg/PmapProgressMeter.jl")
#everywhere using ProgressMeter
#everywhere using PmapProgressmeter
pmap(x->begin sleep(1); x end, Progress(10), 1:10)
PmapProgressMeter on github
ProgressMeter.jl has a branch for pmap.
You can also make the Juno progress bar work inside of pmap. This is kind of using undocumented things, so you should ask in the Gitter if you want more information because posting this public will just confuse people if/when it changes.
You can create a function with 'state' as you ask, by implementing a 'closure'. E.g.
julia> F = function ()
ClosedVar = 5
return (x) -> x + ClosedVar
end;
julia> f = F();
julia> f(5)
10
julia> ClosedVar = 1000;
julia> f(5)
10
As you can see, the function f maintains 'state' (i.e. the internal variable ClosedVar is local to F, and f maintains access to it even though F itself has technically long gone out of scope.
Note the difference with normal, non-closed function definition:
julia> MyVar = 5;
julia> g(x) = 5 + MyVar;
julia> g(5)
10
julia> MyVar = 1000;
julia> g(5)
1005
You can create your own closure which interrogates / updates its closed variables when run, and does something different according to its state each time.
Having said that, from your example you seem to expect that pmap will run sequentially. This is not guaranteed. So don't rely on a 'which index is this thread processing' approach to print every 200 operations. You would probably have to maintain a closed 'counter' variable inside your closure, and rely on that. Which presumably also implies your closure needs to be accessible #everywhere
Why not just include it in your function's definition to print this information? E.g.
function f(n) #something that takes a huge amount of time
...
do stuff.
...
println("completed $n")
end
And, you can add an extra argument to your function, if desired, that would contain that 0.1, ... , 0.9 in your example (which I'm not quite sure what those are, but whatever they are, they can just be an argument in your function).
If you take a look at the example below on pmap and #parallel you will find an example of a function fed to pmap that prints output.
See also this and this SO post on info for feeding multiple arguments to functions used with map and pmap.
The Julia documentation advises that
pmap() is designed for the case where each function call does a large amount of work. In contrast, #parallel for can handle situations where each iteration is tiny, perhaps merely summing two numbers.
There are several reasons for this. First, pmap incurs greater start up costs initiating jobs on workers. Thus, if the jobs are very small, these startup costs may become inefficient. Conversely, however, pmap does a "smarter" job of allocating jobs amongst workers. In particular, it builds a queue of jobs and sends a new job to each worker whenever that worker becomes available. #parallel by contrast, divvies up all work to be done amongst the workers when it is called. As such, if some workers take longer on their jobs than others, you can end up with a situation where most of your workers have finished and are idle while a few remain active for an inordinate amount of time, finishing their jobs. Such a situation, however, is less likely to occur with very small and simple jobs.
The following illustrates this: suppose we have two workers, one of which is slow and the other of which is twice as fast. Ideally, we would want to give the fast worker twice as much work as the slow worker. (or, we could have fast and slow jobs, but the principal is the exact same). pmap will accomplish this, but #parallel won't.
For each test, we initialize the following:
addprocs(2)
#everywhere begin
function parallel_func(idx)
workernum = myid() - 1
sleep(workernum)
println("job $idx")
end
end
Now, for the #parallel test, we run the following:
#parallel for idx = 1:12
parallel_func(idx)
end
And get back print output:
julia> From worker 2: job 1
From worker 3: job 7
From worker 2: job 2
From worker 2: job 3
From worker 3: job 8
From worker 2: job 4
From worker 2: job 5
From worker 3: job 9
From worker 2: job 6
From worker 3: job 10
From worker 3: job 11
From worker 3: job 12
It's almost sweet. The workers have "shared" the work evenly. Note that each worker has completed 6 jobs, even though worker 2 is twice as fast as worker 3. It may be touching, but it is inefficient.
For for the pmap test, I run the following:
pmap(parallel_func, 1:12)
and get the output:
From worker 2: job 1
From worker 3: job 2
From worker 2: job 3
From worker 2: job 5
From worker 3: job 4
From worker 2: job 6
From worker 2: job 8
From worker 3: job 7
From worker 2: job 9
From worker 2: job 11
From worker 3: job 10
From worker 2: job 12
Now, note that worker 2 has performed 8 jobs and worker 3 has performed 4. This is exactly in proportion to their speed, and what we want for optimal efficiency. pmap is a hard task master - from each according to their ability.
One other possibility would be to use a SharedArray as a counter shared amongst the workers. E.g.
addprocs(2)
Counter = convert(SharedArray, zeros(Int64, nworkers()))
## Make sure each worker has the SharedArray declared on it, so that it need not be fed as an explicit argument
function sendto(p::Int; args...)
for (nm, val) in args
#spawnat(p, eval(Main, Expr(:(=), nm, val)))
end
end
for (idx, pid) in enumerate(workers())
sendto(pid, Counter = Counter)
end
#everywhere global Counter
#everywhere begin
function do_stuff(n)
sleep(rand())
Counter[(myid()-1)] += 1
TotalJobs = sum(Counter)
println("Jobs Completed = $TotalJobs")
end
end
pmap(do_stuff, 1:10)

Task Scheduling Optimization with dependency and worker constraint

We are confronted with a Task scheduling Problem
Specs
We have N workers available, and a list of tasks to do.
Each task-->Ti needs Di (i.e. worker*days) to finish (Demand), and can only hold no more than Ci workers to work on it simultaneously (Capacity).
And some tasks can only start after other task(s) are done (Dependency).
The target is to achieve total minimal duration by allocating workers to those sequences.
Example
Number of workers: 10
Taks List: [A, B, C]
Demand: [100 50 10] - unit: workerday (Task A need 100 workerday to finish, B needs 50 workerday, and C need 10 workerday)
Capacity: [10 10 2] - unit: worker (Task A can only 10 workers to work on it at the same time, B can only hold 10, and C can only hold 2)
Dependency: {A: null, B: null, C: B} - A and B can start at any time, C can only start after B is done
Possible approaches to the example problem:
First assign B 10 workers, and it will take 50/10 = 5 days to finish. Then at day 5, we assign 2 workers to C, and 8 workers to A, it will take max(10/2 = 5, 100/8 = 12.5) = 12.5 days to finish. Then the total duration is 5 + 12.5 = 17.5 days.
First assign A 10 workers, and it takes 100/10 = 10 days to finish. Then at day 10, we assign 10 workers to B, which takes 50/10 = 5 days to finish. Then at day 15, we assign 2 workers to C, which takes 10/2 = 5 days to finish. The total duration is 10+5+5 = 20 days.
So the first practice is better, since 17.5 < 20.
But there are still many more possible allocation practices to the example problem, and we are not even sure about what is the best practice to get the minimal total duration for it.
What we want is an algorithm:
Input:
Nworker, Demand, Capacity, Dependency
output: worker allocation practice with the minimal total duration.
Possible Allocation Strategies we've considered when allocating for the tasks without dependency:
First finish the tasks dependent by others as soon as possible (say, finish B as soon as possible in the example)
Allocate workers to tasks with maximam demand (say, first allocate all workers to A in the example)
But none of the two proves to be the optimal strategy.
Any idea or suggestion would be appreciated. Thanks !
This sounds like Job Shop Scheduling with dependencies, which is NP-complete (or NP-hard). So scaling out and delivering an optimal solution in reasonable time is probably impossible.
I've got good results on similar cases (Task assigning and Dependend Job Scheduling) by doing first a Construction Heuristic (pretty much one of those 2 allocation strategies you got there) and then doing a Local Search (usually Late Acceptance or Tabu Search) to get to near optimal results.

Understanding Machine Scheduling

I'm currently learning about priority queues and heaps in my Data Structures class and all that stuff and in the class power points there is a little section that introduces machine scheduling, and I'm having difficulty understanding what is going on.
It begins by giving an example:
m identical machines
n jobs/tasks to be performed
assign jobs to machines so that the time at which the last job completes is minimum. -->The wording of this last part sort of throws me of...what exactly does the italicized portion mean? Can somebody word it differently?
Continuing with the example it says:
3 machines and 7 jobs
job times are [6, 2, 3, 5, 10, 7, 14]
possible schedule, followed by this picture:
(Example schedule is constructed by scheduling the jobs in the order they appear in the given job list (left to right); each job is scheduled on the machine on which it will complete earliest.
)
Finish time = 21
Objective: find schedules with minimum finish time
And I don't really understand what is going on. I don't understand what is being accomplished, or how they came up with that little picture with the jobs and the different times...Can somebody help me out?
"The time at which the last job completes is minimum" = "the time at which the all jobs are finished", if that helps.
In your example, that happens at time = 21. Clearly there's no jobs still running after that time, and all jobs have been scheduled (i.e. you can't schedule no jobs and say the minimum time is time = 0).
To explain the example:
The given jobs are the duration of the jobs. The job with duration 6 is scheduled first - since scheduling it on machines A, B or C will all end up with it finishing at time 6, which one doesn't really matter, so we just schedule it on machine A. Then the job with duration 2 is scheduled. Similarly it can go on B or C (if it were to go on A, it would finish at time 8, so that's not in line with our algorithm), and we schedule it on B. Then the job with duration 3 is scheduled. The respective end times for machines A, B and C would be 9, 5 and 3, so we schedule it on machine C. And so on.
Although the given algorithm is not the best we can do (but perhaps there is something enforcing the order, although that won't make too much sense). One better assignment:
14 16
A | 14 |2|
10 16
B | 10 | 6 |
7 10 15
C | 7 | 3| 5 |
Here all jobs are finished at time = 16.
I've listed the actual job chosen for each slot in the slot itself to hopefully explain it better to possibly clear up any remaining confusion (for example, on machine A, you can see that the jobs with duration 14 and 2 were scheduled, ending at time 16).
I'm sure the given algorithm was just an introduction to the problem and you'll get to always producing the best result soon.
What's being accomplished with trying to get all jobs to finish as soon as possible: think of a computer with multiple cores for example. There are many reasons you'd want tasks to finish as soon as possible. Perhaps you're playing a game and you have a bunch of tasks that work out what's happening (maybe there's a task assigned to each unit / a few units to determine what it does). You can only display after all tasks is finished, so if you don't try to finish as soon as possible, you'll unnecessarily make the game slow.

Performance problem: CPU intensive work performs better with more concurrency in Erlang

tl;dr
I'm getting better performance with my erlang program when I perform my CPU intensive tasks at higher concurrency (e.g. 10K at once vs. 4). Why?
I'm writing a map reduce framework using erlang, and I'm doing performance tests.
My map function is highly CPU intensive (mostly pure calculation). It also needs access to some static data, so I have a few persistent (lingering i.e. lives through the app. life cycle) worker processes on my machine, each having a part of this data in-memory, and awaiting map requests. The output of map is sent to the manager process (which sent out the map requests to the workers), where the reduce (very lightweight) is performed.
Anyways, I noticed that I'm getting better throughput when I immediately spawn a new process for each map request that the workers receives, rather than letting the worker process itself synchronously perform the map request by itself one-by-one (thus leaving bunch of map requests in its process queue, because I'm firing the map requests all at once).
Code snippet:
%% When I remove the comment, I get significant performance boost (90% -> 96%)
%% spawn_link(fun()->
%% One invocation uses around 250ms of CPU time
do_map(Map, AssignedSet, Emit, Data),
Manager ! {finished, JobId, self(), AssignedSet, normal},
%% end),
Compared to when I perform the same calculation in a tight loop, I get 96% throughput (efficiency) using the "immediately spawning" method (e.g. 10000 map reduce jobs running completely in parallel). When I use the "worker performs one-by-one" method, I get only around 90%.
I understand Erlang is supposed to be good at concurrent stuff, and I'm impressed that efficiency doesn't change even if I perform 10K map reduce requests at once as opposed to 100 etc! However, since I have only 4 CPU cores, I'd expect to get better throughput if I use lower concurrency like 4 or maybe 5.
Weirdly, my CPU usage looks very similar in the 2 different implementation (almost completely pegged at 100% on all cores). The performance difference is quite stable. I.e. even when I just do 100 map reduce jobs, I still get around 96% efficiency with the "immediately spawn" method, and around 90% when I use "one-by-one" method. Likewise when I test with 200, 500, 1000, 10K jobs.
I first suspected that the queuing at the worker process queue is the culprit, but even when I should only have something like 25 messages in the worker process queue, I still see the lower performance. 25 messages seem to be quite small for causing a clog (I am doing selective message matching, but not in a way the process would have to put messages back to the queue).
I'm not sure how I should proceed from here. Am I doing something wrong, or am I completely missing something??
UPDATE
I did some more tests and found out that the performance difference can disappear depending on conditions (particularly into how many worker process I divide the static data). Looks like I have much more to learn!
Assuming 1 worker process with 3 map actions, we have the first variants:
_______ _______ _______
| m | | m | | m |
| | | | | |
_| |_| |_| |_
a a a r
Where a is administrative tasks (reading from the message queue, dispatching the map etc.) m is the actual map and r is sending back the result. The second variant where a process is spawned for every map:
_________________._
| m r
| ___________________._
| | m r
| | _____________________._
_|_|_| m r
a a a
As you can see, there's both administrative tasks (a) going on as the same time as maps (m) and as the same times as sending back results (r).
This will keep the CPU busy with map (i.e. calculation intensive) work all the time, as opposed to having short dips every now and then. This is most likely the small gain you see in throughput.
As you have quite high concurrency from the beginning, you only see a relatively small gain in throughput. Compare this to theoretically running only one worker process (as in the first variant) you'll see much bigger gains.
First, let me remarkt this is a very interesting question. I'd like to give you some hints:
Task switching occours per run queue ([rq:x] in the shell) due to reduction: if the Erlang process calls a BIF or user-defined function, it increases it's reduction counter. When running CPU intensive code in one process, it increases it's reduction counter very often. When the reduction counter reaches a certain threshold, a process-switch will occur. (So one process with longer life-time has the same overhead as a multiple processes with shorter life-time: they both have the "same" total reduction counter and fire it when it reaches a threshold, e.g. the one process: 50,000 reductions, more processes: 5 * 10,000 reductions = 50,000 reductions.) (Runtime reasons)
Running on 4 cores vs. 1 core makes a difference: however, timing is the difference. The reason your cores are at 100% is because one or more core(s) is/are doing the mapping, while the other(s) are/is effectively "filling" your message queue. When you are spawning the mapping, there is less time to "fill" the message queue, more time to do the mapping. Apparently, mapping is a more costly operation than filling the queue, and giving it more cores thus increases performance. (Timing/tuning reasons)
You'll get higher throughput when you increase concurrency levels, if processes are waiting (receiving/calling OTP servers/etc.). For instance: requesting data from your static persistent workers takes some time. (Language reasons)

Resources