Measure parallelizability of task scheduler graph in Dask - parallel-processing

If I have a large custom task graph fed into Dask, and a list of times that each task will take (e.g. in the simplest case, each task might take the same number of CPU seconds/minutes), is there a way to calculate how "parallelizable" the problem is from the structure of the graph, without actually running the tasks? The assumption would be that the time to calculate scheduling priorities etc is negligible compared to the time of each task.
I'm thinking of a metric either that plots the number of parallel processes being used over time (assuming an infinite number of available cores), or perhaps the number of hours that the whole graph would take, given a fixed number of CPU cores.
If not, this seems like a nice thing to be able to provide for the task scheduler, especially if individual tasks are expected to take many minutes or hours. However, I guess it might be hard to calculate analytically, depending on the algorithm(s) underlying task allocation in Dask.

Related

A new project management algorithm wanted for finite number of workers

I'm studying task-based parallel computing and got interested in a variation of the old project management problem -- the critical path of an activity-on-vertex (AOV) project network, which can be calculated using the topological sorting algorithm if there's no deadlock cycle. The total time of those activities on a critical path gives the minimum completion time of the project.
But this is assuming we always have enough workers simultaneously finishing the activities with no dependence on each other. If the number of workers (processors/cores) available is finite, certain activities can wait not because some activities they depend on have not yet been finished, but simply because all workers are now busy doing other activities. This is a simplified model for today's multi-core parallel computing. If there's only one worker who has to do all the activities, the project completion time is the total time of all activities. We are back to single-core serial computing that way.
Is there an efficient algorithm that gives the minimum completion time of an AOV network given a finite number of workers available? How should we wisely choose which activities to do first when the doable activities is more than the number of workers so as to minimize the idling time of workers later on? The minimum time should be somewhere in between the critical path time (infinite workers) and the total time of all activities (one worker). It should also be greater than equal to the total time divided by the number of workers (no idling). Is there an algorithm to get that minimum time?
I found a C++ conference video called "work stealing" that almost answers my question. At 18:40, the problem is said on the slide to be NP-hard if activities cannot be paused, further divided, or transferred from worker to worker. Such restrictions make decisions of which workers to finish which jobs (activities) too hard to make. Work stealing is therefore introduced to avoid making such difficult decisions beforehand. Instead, it makes such decions no longer crucial so long as certain apparent greedy rules are followed. The whole project will be always finished as soon as possible under the constraint of either the critical path or the no-idling time of the finite number of workers or both. The video then goes on talking about how to make the procedure of "work stealing" between different workers (processors) more efficient by making the implementation distributed and cache-friendly, etc.
According to the video, future C++ shared-memory parallel coding will be task-based rather than loop-based. To solve a problem, the programmer defines a bunch of tasks to finish and their dependence relations to respect, and then the coding language will automatically schedule the tasks on multiple cores at run time in a flexible way. This "event-driven"-like way of implementing a flexible code by a distributed task queuing system will become very useful in parallel computing.
When an optimization problem is NP-hard, the best way to solve it is to find ways to avoid it.

How to calculate proper timeout or ETA value

The problem is as follows: I have a service broker which distributes different types of tasks to workers. These tasks are of different size and complexity and thus require a different amount of time to process.
Now I would like to calculate a time-out value for each type of service, so that after it has elapsed the client can be notified that the task took to long, and that likely something went wrong.
There are two options which I've tried, but both have flaws.
Do not calculate the time-out, but make it an configurable setting (annoying since it's pretty arbitrary)
Start with a very generous time-out and after each task calculate some sort of running statistic (like a running average + variance). This has the problem that it converges to the mean since longer running tasks get censored. Eventually allowing a far too narrow band of task durations.
Since the tail of the distribution of task durations is likely long (i.e. longer than a normal distribution) it is likely that some more complicated statistic is needed.
Is there an accepted way of calculating time-out values based on task durations?

What's the best Task scheduling algorithm for some given tasks?

We have a list of tasks with different length, a number of cpu cores and a Context Switch time.
We want to find the best scheduling of tasks among the cores to maximize processor utilization.
How could we find this?
Isn't it like if we choose the biggest available tasks from the list and give them one by one to the current ready cores, it's going to be best or you think we must try all orders to find out which is the best?
I must add that all cores are ready at the time unit 0 and the tasks are supposed to work concurrently.
The idea here is that there's no silver bullet, for what you must consider what are the types of tasks being executed, and try to schedule them as nicely as possible.
CPU-bound tasks don't use much communication (I/O), and thus, need to be continuously executed, and interrupted only when necessary -- according to the policy being used;
I/O-bound tasks may be continuously put aside in the execution, allowing other processes to work, since it will be sleeping for many periods, waiting for data to be retrieved to primary memory;
interative tasks must be continuously executed, but needs not to be executed without interruptions, as it will generate interruptions, waiting for user inputs, but it needs to have a high priority, in order not to let the user notice delays in the execution.
Considering this, and the context switch costs, you must evaluate what types of tasks you have, choosing, thus, one or more policies for your scheduler.
Edit:
I thought this was a simply conceptual question. Considering you have to implement a solution, you must analyze the requirements.
Since you have the length of the tasks, and the context switch times, and you have to maintain the cores busy, this becomes an optimization problem, where you must keep the minimal number of cores idle when it reaches the end of the processes, but you need to maintain the minimum number of context switches, so that your overall execution time does not grow too much.
As pointed by svick, this sounds like a partition problem, which is NP-complete, and in which you need to divide a sequence of numbers into a given number of lists, so that the sum of each list is equal to each other.
In your problem you'd have a relaxation on the objective, so that you no longer need all the cores to execute the same amount of time, but you want the difference between any two cores execution time to be as small as possible.
In the reference given by svick, you can see a dynamic programming approach that you may be able to map onto your problem.

Variation of the job scheduling prob

I'm doing some administration work for an aviation transport company. They build aircraft containers and such here. One of the things they want me to code is a order optimization script that the guys on the floor can use to get the most out of the given material. To give a simple overview: say we order a certain amount beams that are 10 meters per unit. We need beam chunks of 5x 6m, 10x 3.5m, 4x 3m, which are acquired by cutting the 10m in smaller parts. What would be the minimum amount of 10m beams we need to order?
There are some parallels with the multiprocessor job scheduling problem (one beam is a processor, each chunk a job), although that focusses on minimizing the time required to perform all jobs instead of minimizing the amount of processors needed to perform all jobs within a pre-set time. The multiprocessor job scheduling problem is in NP-complete, but I wonder if my variation of the problem is too. Does anybody know similar problems and methods for solving them?
This problem is exactly: http://en.wikipedia.org/wiki/Cutting_stock_problem (more generally http://en.wikipedia.org/wiki/Bin_packing_problem). You can use any old ILP solver. I like http://lpsolve.sourceforge.net/5.5/, its quite friendly to use.

How can I determine the appropriate number of tasks with GCD or similar?

I very often encounter situations where I have a large number of small operations that I want to carry out independently. In these cases, the number of operations is so large compared to the actual time each operation takes so simply creating a task for each operation is inappropriate due to overhead, even though GCD overhead is typically low.
So what you'd want to do is split up the number of operations into nice chunks where each task operates on a chunk. But how can I determine the appropriate number of tasks/chunks?
Testing, and profiling. What makes sense, and what works well is application specific.
Basically you need to decide on two things:
The number of worker processes/threads to generate
The size of the chunks they will work on
Play with the two numbers, and calculate their throughput (tasks completed per second * number of workers). Somewhere you'll find a good equilibrium between speed, number of workers, and number of tasks in a chunk.
You can make finding the right balance even simpler by feeding your workers a bunch of test data, essentially a benchmark, and measuring their throughput automatically while adjusting these two variables. Record the throughput for each combination of worker size/task chunk size, and output it at the end. The highest throughput is your best combination.
Finally, if how long a particular task takes really depends on the task itself (e.g. some tasks take X time, and while some take X*3 time, then you can can take a couple of approaches. Depending on the nature of your incoming work, you can try one of the following:
Feed your benchmark historical data - a bunch of real-world data to be processed that represents the actual kind of work that will come into your worker grid, and measure throughput using that example data.
Generate random-sized tasks that cross the spectrum of what you think you'll see, and pick the combination that seems to work best on average, across multiple sizes of tasks
If you can read the data in a task, and the data will give you an idea of whether or not that task will take X time, or X*3 (or something in between) you can use that information before processing the tasks themselves to dynamically adjust the worker/task size to achieve the best throughput depending on current workload. This approach is taken with Amazon EC2 where customers will spin-up extra VMs when needed to handle higher load, and spin them back down when load drops, for example.
Whatever you choose, any unknown speed issue should almost always involve some kind of demo benchmarking, if the speed at which it runs is critical to the success of your application (sometimes the time to process is so small, that it's negligible).
Good luck!

Resources