The problem is as follows: I have a service broker which distributes different types of tasks to workers. These tasks are of different size and complexity and thus require a different amount of time to process.
Now I would like to calculate a time-out value for each type of service, so that after it has elapsed the client can be notified that the task took to long, and that likely something went wrong.
There are two options which I've tried, but both have flaws.
Do not calculate the time-out, but make it an configurable setting (annoying since it's pretty arbitrary)
Start with a very generous time-out and after each task calculate some sort of running statistic (like a running average + variance). This has the problem that it converges to the mean since longer running tasks get censored. Eventually allowing a far too narrow band of task durations.
Since the tail of the distribution of task durations is likely long (i.e. longer than a normal distribution) it is likely that some more complicated statistic is needed.
Is there an accepted way of calculating time-out values based on task durations?
Related
If I have a large custom task graph fed into Dask, and a list of times that each task will take (e.g. in the simplest case, each task might take the same number of CPU seconds/minutes), is there a way to calculate how "parallelizable" the problem is from the structure of the graph, without actually running the tasks? The assumption would be that the time to calculate scheduling priorities etc is negligible compared to the time of each task.
I'm thinking of a metric either that plots the number of parallel processes being used over time (assuming an infinite number of available cores), or perhaps the number of hours that the whole graph would take, given a fixed number of CPU cores.
If not, this seems like a nice thing to be able to provide for the task scheduler, especially if individual tasks are expected to take many minutes or hours. However, I guess it might be hard to calculate analytically, depending on the algorithm(s) underlying task allocation in Dask.
I'm studying task-based parallel computing and got interested in a variation of the old project management problem -- the critical path of an activity-on-vertex (AOV) project network, which can be calculated using the topological sorting algorithm if there's no deadlock cycle. The total time of those activities on a critical path gives the minimum completion time of the project.
But this is assuming we always have enough workers simultaneously finishing the activities with no dependence on each other. If the number of workers (processors/cores) available is finite, certain activities can wait not because some activities they depend on have not yet been finished, but simply because all workers are now busy doing other activities. This is a simplified model for today's multi-core parallel computing. If there's only one worker who has to do all the activities, the project completion time is the total time of all activities. We are back to single-core serial computing that way.
Is there an efficient algorithm that gives the minimum completion time of an AOV network given a finite number of workers available? How should we wisely choose which activities to do first when the doable activities is more than the number of workers so as to minimize the idling time of workers later on? The minimum time should be somewhere in between the critical path time (infinite workers) and the total time of all activities (one worker). It should also be greater than equal to the total time divided by the number of workers (no idling). Is there an algorithm to get that minimum time?
I found a C++ conference video called "work stealing" that almost answers my question. At 18:40, the problem is said on the slide to be NP-hard if activities cannot be paused, further divided, or transferred from worker to worker. Such restrictions make decisions of which workers to finish which jobs (activities) too hard to make. Work stealing is therefore introduced to avoid making such difficult decisions beforehand. Instead, it makes such decions no longer crucial so long as certain apparent greedy rules are followed. The whole project will be always finished as soon as possible under the constraint of either the critical path or the no-idling time of the finite number of workers or both. The video then goes on talking about how to make the procedure of "work stealing" between different workers (processors) more efficient by making the implementation distributed and cache-friendly, etc.
According to the video, future C++ shared-memory parallel coding will be task-based rather than loop-based. To solve a problem, the programmer defines a bunch of tasks to finish and their dependence relations to respect, and then the coding language will automatically schedule the tasks on multiple cores at run time in a flexible way. This "event-driven"-like way of implementing a flexible code by a distributed task queuing system will become very useful in parallel computing.
When an optimization problem is NP-hard, the best way to solve it is to find ways to avoid it.
If i do a benchmark, and for example i found the following:
With 1 concurrent user, The api give 150 req/s. (9000 req/minute)
With more than 300 concurrent user, The api start throwing exception.
An app is doing request 1 every 30 minute.
Is it correct if I say:
the best cases is that the api could handle (30 * 9000 = 270.000 user). That is under 30 minute, there would be 270.000 sequential request and each are coming from different user
The worst cases would be when there is 300 user posting request at the same time.
And if it's true, would there any way to calculate the average case ?
Is is the same as calculating worst case, average case complexity of an algorithm ?
One theoretical tool to answer these questions is http://en.wikipedia.org/wiki/Queueing_theory. It says that you are very unlikely to get the level of performance that you are assuming, because the load applied to the system fluctuates, so that there are busy periods and quiet periods. If the system has nothing to do in quiet periods it is forced into idleness that you haven't accounted for. In busy periods, on the other hand, it will typically build up long queues of pending work, until the queues get so long that customers walk away, or the queues become longer than the system can support and it collapses, or both.
The graph at figure 1 page 3 of http://pages.cs.wisc.edu/~dsmyers/cs547/lecture_12_mm1_queue.pdf shows a graph of response time vs applied load for what is probably the most optimistic even vaguely realistic situation. You can see that response time gets very large as you approach maximum load.
By far the most sensible thing to do is to run tests which apply a realistic load to your application - this is important enough for people to build things like http://jmeter.apache.org/. If you want a rule of thumb I'd say don't plan to stress the system at more than 50% of theoretical capacity as you originally calculated.
We have a list of tasks with different length, a number of cpu cores and a Context Switch time.
We want to find the best scheduling of tasks among the cores to maximize processor utilization.
How could we find this?
Isn't it like if we choose the biggest available tasks from the list and give them one by one to the current ready cores, it's going to be best or you think we must try all orders to find out which is the best?
I must add that all cores are ready at the time unit 0 and the tasks are supposed to work concurrently.
The idea here is that there's no silver bullet, for what you must consider what are the types of tasks being executed, and try to schedule them as nicely as possible.
CPU-bound tasks don't use much communication (I/O), and thus, need to be continuously executed, and interrupted only when necessary -- according to the policy being used;
I/O-bound tasks may be continuously put aside in the execution, allowing other processes to work, since it will be sleeping for many periods, waiting for data to be retrieved to primary memory;
interative tasks must be continuously executed, but needs not to be executed without interruptions, as it will generate interruptions, waiting for user inputs, but it needs to have a high priority, in order not to let the user notice delays in the execution.
Considering this, and the context switch costs, you must evaluate what types of tasks you have, choosing, thus, one or more policies for your scheduler.
Edit:
I thought this was a simply conceptual question. Considering you have to implement a solution, you must analyze the requirements.
Since you have the length of the tasks, and the context switch times, and you have to maintain the cores busy, this becomes an optimization problem, where you must keep the minimal number of cores idle when it reaches the end of the processes, but you need to maintain the minimum number of context switches, so that your overall execution time does not grow too much.
As pointed by svick, this sounds like a partition problem, which is NP-complete, and in which you need to divide a sequence of numbers into a given number of lists, so that the sum of each list is equal to each other.
In your problem you'd have a relaxation on the objective, so that you no longer need all the cores to execute the same amount of time, but you want the difference between any two cores execution time to be as small as possible.
In the reference given by svick, you can see a dynamic programming approach that you may be able to map onto your problem.
I very often encounter situations where I have a large number of small operations that I want to carry out independently. In these cases, the number of operations is so large compared to the actual time each operation takes so simply creating a task for each operation is inappropriate due to overhead, even though GCD overhead is typically low.
So what you'd want to do is split up the number of operations into nice chunks where each task operates on a chunk. But how can I determine the appropriate number of tasks/chunks?
Testing, and profiling. What makes sense, and what works well is application specific.
Basically you need to decide on two things:
The number of worker processes/threads to generate
The size of the chunks they will work on
Play with the two numbers, and calculate their throughput (tasks completed per second * number of workers). Somewhere you'll find a good equilibrium between speed, number of workers, and number of tasks in a chunk.
You can make finding the right balance even simpler by feeding your workers a bunch of test data, essentially a benchmark, and measuring their throughput automatically while adjusting these two variables. Record the throughput for each combination of worker size/task chunk size, and output it at the end. The highest throughput is your best combination.
Finally, if how long a particular task takes really depends on the task itself (e.g. some tasks take X time, and while some take X*3 time, then you can can take a couple of approaches. Depending on the nature of your incoming work, you can try one of the following:
Feed your benchmark historical data - a bunch of real-world data to be processed that represents the actual kind of work that will come into your worker grid, and measure throughput using that example data.
Generate random-sized tasks that cross the spectrum of what you think you'll see, and pick the combination that seems to work best on average, across multiple sizes of tasks
If you can read the data in a task, and the data will give you an idea of whether or not that task will take X time, or X*3 (or something in between) you can use that information before processing the tasks themselves to dynamically adjust the worker/task size to achieve the best throughput depending on current workload. This approach is taken with Amazon EC2 where customers will spin-up extra VMs when needed to handle higher load, and spin them back down when load drops, for example.
Whatever you choose, any unknown speed issue should almost always involve some kind of demo benchmarking, if the speed at which it runs is critical to the success of your application (sometimes the time to process is so small, that it's negligible).
Good luck!