Julia - map function status - parallel-processing

Is there a comfy way to somehow get the 'status' of map/pmap in Julia?
If I had an array a = [1:10] i'd like to either:
1: enumerate the array and use if-conditional to add a print command
((index,value) -> 5*value ......, enumerate(a)
and where the "......." are, there would be a way to 'chain' the anonymous function to something like
"5*value and then print index/length(a) if index%200 == 0"
2: know if there is already an existing option for this, as pmap is intended for parallel tasks, which are usually used for large processes so it would make sence for this to already exist?
Additionally, is there a way to make anonymous functions do two 'separate' things one after the other?
Example
if I have
a = [1:1000]
function f(n) #something that takes a huge ammount of time
end
and I execute
map(x -> f(x), a)
the REPL would print out the status
"0.1 completed"
.
.
.
"0.9 completed"
Solution
Chris Rackauckas answer
A bitt odd that the ProgressMeter package doesnt include this by default
Pkg.add("ProgressMeter")
Pkg.clone("https://github.com/slundberg/PmapProgressMeter.jl")
#everywhere using ProgressMeter
#everywhere using PmapProgressmeter
pmap(x->begin sleep(1); x end, Progress(10), 1:10)
PmapProgressMeter on github

ProgressMeter.jl has a branch for pmap.
You can also make the Juno progress bar work inside of pmap. This is kind of using undocumented things, so you should ask in the Gitter if you want more information because posting this public will just confuse people if/when it changes.

You can create a function with 'state' as you ask, by implementing a 'closure'. E.g.
julia> F = function ()
ClosedVar = 5
return (x) -> x + ClosedVar
end;
julia> f = F();
julia> f(5)
10
julia> ClosedVar = 1000;
julia> f(5)
10
As you can see, the function f maintains 'state' (i.e. the internal variable ClosedVar is local to F, and f maintains access to it even though F itself has technically long gone out of scope.
Note the difference with normal, non-closed function definition:
julia> MyVar = 5;
julia> g(x) = 5 + MyVar;
julia> g(5)
10
julia> MyVar = 1000;
julia> g(5)
1005
You can create your own closure which interrogates / updates its closed variables when run, and does something different according to its state each time.
Having said that, from your example you seem to expect that pmap will run sequentially. This is not guaranteed. So don't rely on a 'which index is this thread processing' approach to print every 200 operations. You would probably have to maintain a closed 'counter' variable inside your closure, and rely on that. Which presumably also implies your closure needs to be accessible #everywhere

Why not just include it in your function's definition to print this information? E.g.
function f(n) #something that takes a huge amount of time
...
do stuff.
...
println("completed $n")
end
And, you can add an extra argument to your function, if desired, that would contain that 0.1, ... , 0.9 in your example (which I'm not quite sure what those are, but whatever they are, they can just be an argument in your function).
If you take a look at the example below on pmap and #parallel you will find an example of a function fed to pmap that prints output.
See also this and this SO post on info for feeding multiple arguments to functions used with map and pmap.
The Julia documentation advises that
pmap() is designed for the case where each function call does a large amount of work. In contrast, #parallel for can handle situations where each iteration is tiny, perhaps merely summing two numbers.
There are several reasons for this. First, pmap incurs greater start up costs initiating jobs on workers. Thus, if the jobs are very small, these startup costs may become inefficient. Conversely, however, pmap does a "smarter" job of allocating jobs amongst workers. In particular, it builds a queue of jobs and sends a new job to each worker whenever that worker becomes available. #parallel by contrast, divvies up all work to be done amongst the workers when it is called. As such, if some workers take longer on their jobs than others, you can end up with a situation where most of your workers have finished and are idle while a few remain active for an inordinate amount of time, finishing their jobs. Such a situation, however, is less likely to occur with very small and simple jobs.
The following illustrates this: suppose we have two workers, one of which is slow and the other of which is twice as fast. Ideally, we would want to give the fast worker twice as much work as the slow worker. (or, we could have fast and slow jobs, but the principal is the exact same). pmap will accomplish this, but #parallel won't.
For each test, we initialize the following:
addprocs(2)
#everywhere begin
function parallel_func(idx)
workernum = myid() - 1
sleep(workernum)
println("job $idx")
end
end
Now, for the #parallel test, we run the following:
#parallel for idx = 1:12
parallel_func(idx)
end
And get back print output:
julia> From worker 2: job 1
From worker 3: job 7
From worker 2: job 2
From worker 2: job 3
From worker 3: job 8
From worker 2: job 4
From worker 2: job 5
From worker 3: job 9
From worker 2: job 6
From worker 3: job 10
From worker 3: job 11
From worker 3: job 12
It's almost sweet. The workers have "shared" the work evenly. Note that each worker has completed 6 jobs, even though worker 2 is twice as fast as worker 3. It may be touching, but it is inefficient.
For for the pmap test, I run the following:
pmap(parallel_func, 1:12)
and get the output:
From worker 2: job 1
From worker 3: job 2
From worker 2: job 3
From worker 2: job 5
From worker 3: job 4
From worker 2: job 6
From worker 2: job 8
From worker 3: job 7
From worker 2: job 9
From worker 2: job 11
From worker 3: job 10
From worker 2: job 12
Now, note that worker 2 has performed 8 jobs and worker 3 has performed 4. This is exactly in proportion to their speed, and what we want for optimal efficiency. pmap is a hard task master - from each according to their ability.

One other possibility would be to use a SharedArray as a counter shared amongst the workers. E.g.
addprocs(2)
Counter = convert(SharedArray, zeros(Int64, nworkers()))
## Make sure each worker has the SharedArray declared on it, so that it need not be fed as an explicit argument
function sendto(p::Int; args...)
for (nm, val) in args
#spawnat(p, eval(Main, Expr(:(=), nm, val)))
end
end
for (idx, pid) in enumerate(workers())
sendto(pid, Counter = Counter)
end
#everywhere global Counter
#everywhere begin
function do_stuff(n)
sleep(rand())
Counter[(myid()-1)] += 1
TotalJobs = sum(Counter)
println("Jobs Completed = $TotalJobs")
end
end
pmap(do_stuff, 1:10)

Related

How to scale concurrent step function executions and avoid any maxConcurrent exceptions?

Problem: I have a Lambda which produces an array of objects which can have the length of a few thousands (worst case). Each object in this array should be processed by a stepfunction.
I am trying to figure out what the best scalable and error prone solution is so that every object is processed by the stepfunction.
The complete stepfunction does not have a long execution time (under 5 min) but has to wait in some steps for other services to continue the execution (WaitForTaskToken). The stepfunction contains a few short running lambdas.
These are the possibilities I have at the moment:
1. Naive approach: In my head a few thousands or even ten thousands execution concurrent are not a big deal so why can't I just iterate over each element and start an execution directly from the lambda?
2. SQS. Lambda can put each object into SQS and another lambda processes a batch of 10 and starts 10 stepfunction executions. Then I could have some max concurrency of the processing lambda to avoid to many stepfunction executions. But this explains of some issues with such an approach where messages could not be processed, and overall this is alot of overhead I think.
3. Using a Map State: I just could give the array to a mapstate which runs for each object the statemachine with max 40 concurrent iterations. But what if the array is greater than 40? Can I just catch the error and retry with the objects which were not processed in a error catch state so long until all executions are either done or failed. This means if there is one failed execution I still want to have the other 39 executions to run.
4. Split the object in batches and run them parallel: Similar to 3. but instead of just giving all objects to the map state, there is another state which splits the array in 40s and forwards them to the map state and waits until they are finished to process the next batch. So there is one "main" state which runs for a longer time + 40 worker states at the same time.
All of those approaches only take the step function execution concurrency into account but not the lambda concurrencies. Since the stepfunctions uses lambdas there are also alot of concurrent lambdas running. Could this be an issue? And if so, how can I mitigate this?
Inline Map States can handle lots1 of iterations, but only up to 40 concurrently2. Iterations over the MaxConcurrency don't cause an error. They will be invoked with delay.
If your Step Function is only running ~40 concurrent iterations, Lambda concurrency should not be a constraint either.
I just tested a Map state with 1,000 items. Worked just fine. The Quotas page does not mention an upper limit.
In Distributed mode a Map State can handle 10,000 parallel child executions.

Running thousands of goroutines concurrently

I'm developing a simulator model for historical trading data. I have the below pseudo code where I'm using many goroutines.
start_day := "2015-01-01 09:15:00 "
end_day := "2015-02-01 09:15:00"
ch = make(chan float64)
var value bool
// pair_map have some int pairs like {1,2},{10,11} like approx 20k pairs which are taken
// dynamically from a range of input
for start_day.Before(end_day) {
// a true value in ch is set at end of each func call
go get_output(ch, start_day, pair_map)
value <- ch
start_day = start_date.Add(time.Hour * 24)
}
Now the problem here is that each get_otput() is executed approx. for 20k combinations per single day (which is taking approx. 1 min), here I'm trying to do it for a month time span using go routines (on single core machine), but it is taking almost same amount of time as running sequentially (approx. 30 min).
Is there anything wrong with my approach, or is it because I'm using a single core?
I'm trying to do it for a month time span using go routines (on single core machine), but it is taking almost same amount of time as running sequentially
What makes you think it should perform any better? Running multiple goroutines has its overhead: goroutines have to be managed and scheduled, results must be communicated / gathered. If you have a single core, using multiple gorotuines cannot yield a performance improvement, only detriment.
Using goroutines may give a performance boost if you have multiple cores and the task you distribute are "large" enough so the goroutine management will be smaller than the gain of the parallel execution.
See related: Is golang good to use in multithreaded application?

Difference between parallel map and parallel for-loop

when I read the Julia document of multi-core parallel computing, I noticed there are both parallel map pmap and for-loop #distributed for.
From the documentation, "Julia's pmap is designed for the case where each function call does a large amount of work. In contrast, #distributed for can handle situations where each iteration is tiny".
What makes the difference between pmap and #distributed for? Why #distributed for is slow for a large amount of work?
Thanks
The issue is that pmap does load balancing while #distributed for splits jobs into equal chunks. You can confirm this by running these two code examples:
julia> #time res = pmap(x -> (sleep(x/10); println(x)), [10;ones(Int, 19)]);
From worker 2: 1
From worker 3: 1
From worker 4: 1
From worker 2: 1
From worker 3: 1
From worker 4: 1
From worker 3: 1
From worker 2: 1
From worker 4: 1
From worker 4: 1
From worker 2: 1
From worker 3: 1
From worker 2: 1
From worker 3: 1
From worker 4: 1
From worker 4: 1
From worker 3: 1
From worker 2: 1
From worker 4: 1
From worker 5: 10
1.106504 seconds (173.34 k allocations: 8.711 MiB, 0.66% gc time)
julia> #time #sync #distributed for x in [10;ones(Int, 19)]
sleep(x/10); println(x)
end
From worker 4: 1
From worker 3: 1
From worker 5: 1
From worker 4: 1
From worker 5: 1
From worker 3: 1
From worker 5: 1
From worker 3: 1
From worker 4: 1
From worker 3: 1
From worker 4: 1
From worker 5: 1
From worker 4: 1
From worker 5: 1
From worker 3: 1
From worker 2: 10
From worker 2: 1
From worker 2: 1
From worker 2: 1
From worker 2: 1
1.543574 seconds (184.19 k allocations: 9.013 MiB)
Task (done) #0x0000000005c5c8b0
And you can see that the large job (value 10) makes pmap execute all small jobs on workers different than the one that got the large job (in my example worker 5 did only job 10 while workers 2 to 4 did all other jobs). On the other hand #distributed for assigned the same number of jobs to each worker. Thus the worker that got job 10 (worker 2 in the second example) still had to do four short jobs (as each worker on the average has to do 5 jobs - my example has 20 jobs in total and 4 workers).
Now the advantage of #distributed for is that if the job is inexpensive then equal splitting of jobs among workers avoids having to do the dynamic scheduling which is not for free either.
In summary, as the documentation states, if the job is expensive (and especially if its run time can vary largely), it is better to use pmap as it does load-balancing.
pmap has a batch_size argument which is, by default, 1. This means that each element of the collection will be sent one by one to available workers or tasks to be transformed by the function you provided. If each function call does a large amount of work and perhaps each call differs in time it takes, using pmap has the advantage of not letting workers go idle, while other workers do work, because when a worker completes one transformation, it will ask for the next element to transform. Therefore, pmap effectively balances load among workers/tasks.
#distributed for-loop, however, partitions a given range among workers once at the beginning, not knowing how much time each partition of the range will take. Consider, for example, a collection of matrices, where the first hundred elements of the collection are 2-by-2 matrices, the next hundred elements are 1000-by-1000 matrices and we would like to take the inverse of each matrix using #distributed for-loops and 2 worker processes.
#sync #distributed for i = 1:200
B[i] = inv(A[i])
end
The first worker will get all the 2-by-2 matrices and the second one will get 1000-by-1000 matrices. The first worker will complete all the transformation very quickly and go idle, while the other will continue to do work for very long time. Although you are using 2 workers, the major part of the whole work will effectively be executed in serial on the second worker and you will get almost no benefit from using more than one worker. This problem is known as load balancing in the context of parallel computing. The problem may also arise, for example, when one processor is slow and the other is fast even if the work to be completed is homogeneous.
For very small work transformations, however, using pmap with a small batch size creates a communication overhead that might be significant since after each batch the processor needs to get the next batch from the calling process, whereas with #distributed for-loops each worker process will know, at the beginning, which part of the range it is responsible for.
The choice between pmap and #distributed for-loop depends on what you want to achieve. If you are going to transform a collection as in map and each transformation requires a large amount of work and this amount is varying, then you are likely to be better of choosing pmap. If each transformation is very tiny, then you are likely to be better of choosing #distributed for-loop.
Note that, if you need a reduction operation after the transformation, #distributed for-loop already provides one, most of the reductions will be applied locally while the final reduction will take place on the calling process. With pmap, however, you will need to handle the reduction yourself.
You can also implement your own pmap function with very complex load balancing and reduction schemes if you really need one.
https://docs.julialang.org/en/v1/manual/parallel-computing/

Task Scheduling Optimization with dependency and worker constraint

We are confronted with a Task scheduling Problem
Specs
We have N workers available, and a list of tasks to do.
Each task-->Ti needs Di (i.e. worker*days) to finish (Demand), and can only hold no more than Ci workers to work on it simultaneously (Capacity).
And some tasks can only start after other task(s) are done (Dependency).
The target is to achieve total minimal duration by allocating workers to those sequences.
Example
Number of workers: 10
Taks List: [A, B, C]
Demand: [100 50 10] - unit: workerday (Task A need 100 workerday to finish, B needs 50 workerday, and C need 10 workerday)
Capacity: [10 10 2] - unit: worker (Task A can only 10 workers to work on it at the same time, B can only hold 10, and C can only hold 2)
Dependency: {A: null, B: null, C: B} - A and B can start at any time, C can only start after B is done
Possible approaches to the example problem:
First assign B 10 workers, and it will take 50/10 = 5 days to finish. Then at day 5, we assign 2 workers to C, and 8 workers to A, it will take max(10/2 = 5, 100/8 = 12.5) = 12.5 days to finish. Then the total duration is 5 + 12.5 = 17.5 days.
First assign A 10 workers, and it takes 100/10 = 10 days to finish. Then at day 10, we assign 10 workers to B, which takes 50/10 = 5 days to finish. Then at day 15, we assign 2 workers to C, which takes 10/2 = 5 days to finish. The total duration is 10+5+5 = 20 days.
So the first practice is better, since 17.5 < 20.
But there are still many more possible allocation practices to the example problem, and we are not even sure about what is the best practice to get the minimal total duration for it.
What we want is an algorithm:
Input:
Nworker, Demand, Capacity, Dependency
output: worker allocation practice with the minimal total duration.
Possible Allocation Strategies we've considered when allocating for the tasks without dependency:
First finish the tasks dependent by others as soon as possible (say, finish B as soon as possible in the example)
Allocate workers to tasks with maximam demand (say, first allocate all workers to A in the example)
But none of the two proves to be the optimal strategy.
Any idea or suggestion would be appreciated. Thanks !
This sounds like Job Shop Scheduling with dependencies, which is NP-complete (or NP-hard). So scaling out and delivering an optimal solution in reasonable time is probably impossible.
I've got good results on similar cases (Task assigning and Dependend Job Scheduling) by doing first a Construction Heuristic (pretty much one of those 2 allocation strategies you got there) and then doing a Local Search (usually Late Acceptance or Tabu Search) to get to near optimal results.

Understanding Machine Scheduling

I'm currently learning about priority queues and heaps in my Data Structures class and all that stuff and in the class power points there is a little section that introduces machine scheduling, and I'm having difficulty understanding what is going on.
It begins by giving an example:
m identical machines
n jobs/tasks to be performed
assign jobs to machines so that the time at which the last job completes is minimum. -->The wording of this last part sort of throws me of...what exactly does the italicized portion mean? Can somebody word it differently?
Continuing with the example it says:
3 machines and 7 jobs
job times are [6, 2, 3, 5, 10, 7, 14]
possible schedule, followed by this picture:
(Example schedule is constructed by scheduling the jobs in the order they appear in the given job list (left to right); each job is scheduled on the machine on which it will complete earliest.
)
Finish time = 21
Objective: find schedules with minimum finish time
And I don't really understand what is going on. I don't understand what is being accomplished, or how they came up with that little picture with the jobs and the different times...Can somebody help me out?
"The time at which the last job completes is minimum" = "the time at which the all jobs are finished", if that helps.
In your example, that happens at time = 21. Clearly there's no jobs still running after that time, and all jobs have been scheduled (i.e. you can't schedule no jobs and say the minimum time is time = 0).
To explain the example:
The given jobs are the duration of the jobs. The job with duration 6 is scheduled first - since scheduling it on machines A, B or C will all end up with it finishing at time 6, which one doesn't really matter, so we just schedule it on machine A. Then the job with duration 2 is scheduled. Similarly it can go on B or C (if it were to go on A, it would finish at time 8, so that's not in line with our algorithm), and we schedule it on B. Then the job with duration 3 is scheduled. The respective end times for machines A, B and C would be 9, 5 and 3, so we schedule it on machine C. And so on.
Although the given algorithm is not the best we can do (but perhaps there is something enforcing the order, although that won't make too much sense). One better assignment:
14 16
A | 14 |2|
10 16
B | 10 | 6 |
7 10 15
C | 7 | 3| 5 |
Here all jobs are finished at time = 16.
I've listed the actual job chosen for each slot in the slot itself to hopefully explain it better to possibly clear up any remaining confusion (for example, on machine A, you can see that the jobs with duration 14 and 2 were scheduled, ending at time 16).
I'm sure the given algorithm was just an introduction to the problem and you'll get to always producing the best result soon.
What's being accomplished with trying to get all jobs to finish as soon as possible: think of a computer with multiple cores for example. There are many reasons you'd want tasks to finish as soon as possible. Perhaps you're playing a game and you have a bunch of tasks that work out what's happening (maybe there's a task assigned to each unit / a few units to determine what it does). You can only display after all tasks is finished, so if you don't try to finish as soon as possible, you'll unnecessarily make the game slow.

Resources