We have a logical queue of tasks, where each task has to be assigned to multiple workers.
The numbers of workers to be assigned are based on a configuration of Minimum and Maximum workers.
A worker should not see the same task that they have already completed. It is not necessary that all workers will see all the tasks.
Total number of workers can change dynamically. Each worker can become online or offline anytime.
Each worker can choose to either complete the task or let it expire.
On expiry the task should be assigned to any worker who has not already completed the task.
Is there a good algorithm to solve this scenario ?
The easy solution:
Greedily assign tasks:
For every task that is ready.
Find the Minimum <= N <= Maximum workers that
didn't see the task yet and assign them.
Repeat until you either run
out of workers or finish all the tasks.
If a worker comes online or finishes task, re-check for all the tasks.
If a new task comes, re-check for the available workers.
This solution might be enough if there are not that many tasks, as it is computation heavy and recomputes everything.
The possible optimizations:
If the greedy solution fails(and it probably will), there are ways to improve on it. I will try to list those that come to my mind but it won't be an exhaustive list.
First, my personal favorite: Network flows. Unfortunately I don't see an easy way to resolve the minimal number of workers requirement, however it would be fast and it would result in a maximum possible workers being assigned at any given moment.
Create the network Source - Workers - Tasks - Sink. Edges workers to tasks will be linked and unlinked as needed:
when a worker is available for a task, create the edge with weight 1, otherwise don't create the edge.
From source link an edge with weight one to each online worker.
From every task link an edge with weight equal to its maximal worker capacity.
You could even differentiate between different kinds of workers, network flows are awesome. The algorithms are fast which makes them suitable even for large graphs. Also they are available in many libraries so you won't have to implement them yourself. Unfortunately, there is no easy way to enforce the minimal workers rule. At least I don't see one right now, there might be some way. Or maybe at least a heuristic
Second, be smart while being greedy:
Create a queue for every task.
When a worker is available, register him for every task he can do into its queue.
When a worker is not available, remove him from the queues.
When a task has enough workers, start progress and disable these workers.
This is still the brute force approach however since you keep the queues, you limit the amount of the necessary computation to a reasonable level. The potential downside is that large tasks(with a large minimal number of workers) might be stalled by the small tasks that will be easier to start - and will eat up the workers. So there will probably be some further checking/balancing and prioritizing needed.
There is definitely more to be tried & done for your task at hand, however the information you provided is rather limited so this advice is not that specific.
In cpu scheduling, is there any possible situation that first in first out can faster than shortest job first(non-preemptive)
Faster doesn't always make sense here.
If you sum up the time for a bunch of jobs they will equal the same thing at the end ([single thread] or [single core ignoring context switching overhead]).
For multiple threads and/or processes FIFO's are faster when the jobs are very small as there is less overhead involved in queuing and dequeuing. When the algorithm dominates the CPU time FIFO is probably faster.
For overhead reasons FIFO's also perform better when you can add items to the queue in what would be statically similar to SJF. Meaning that you can approximate what the SJF would be by the order which you add things to the queue.
NOTE: I don't have sources for this, it simply makes logical sense.
In Multilevel Feedback Scheduling at the base level queue, the processes circulate in round robin fashion until they complete and leave the system. Processes in the base level queue can also be scheduled on a first come first served basis.
Why can't they be scheduled on Shortest Job First (SJF) algorithm instead of First Come First Serve (FCFS) algorithm which seems to improve average performance of the algorithm.
One simple reason:
The processes fall in the base level queue after they fail to finish in the time quantum alloted to them in the higher level queues. If you implement SJF algorithm in the base level queue, you may starve a process because shorter job may keep coming before a longer executing process ever gets the CPU.
The SJF algorithm gives more througput, only when processes differ a lot in their burst time. However its not always the case that it will perform better than FCFS. Take a loot at this answer.
Since in Multilevel Feedback Scheduling algorithm, all the processes that are unable to complete execution within defined time quantum of first 2 queues, are put to the last queue having FCFS, its very likely that they all have large CPU bursts and therefore wont differ much in their burst time. Hence, its preferred to have FCFS, scheduling for the last queue.
We have a list of tasks with different length, a number of cpu cores and a Context Switch time.
We want to find the best scheduling of tasks among the cores to maximize processor utilization.
How could we find this?
Isn't it like if we choose the biggest available tasks from the list and give them one by one to the current ready cores, it's going to be best or you think we must try all orders to find out which is the best?
I must add that all cores are ready at the time unit 0 and the tasks are supposed to work concurrently.
The idea here is that there's no silver bullet, for what you must consider what are the types of tasks being executed, and try to schedule them as nicely as possible.
CPU-bound tasks don't use much communication (I/O), and thus, need to be continuously executed, and interrupted only when necessary -- according to the policy being used;
I/O-bound tasks may be continuously put aside in the execution, allowing other processes to work, since it will be sleeping for many periods, waiting for data to be retrieved to primary memory;
interative tasks must be continuously executed, but needs not to be executed without interruptions, as it will generate interruptions, waiting for user inputs, but it needs to have a high priority, in order not to let the user notice delays in the execution.
Considering this, and the context switch costs, you must evaluate what types of tasks you have, choosing, thus, one or more policies for your scheduler.
Edit:
I thought this was a simply conceptual question. Considering you have to implement a solution, you must analyze the requirements.
Since you have the length of the tasks, and the context switch times, and you have to maintain the cores busy, this becomes an optimization problem, where you must keep the minimal number of cores idle when it reaches the end of the processes, but you need to maintain the minimum number of context switches, so that your overall execution time does not grow too much.
As pointed by svick, this sounds like a partition problem, which is NP-complete, and in which you need to divide a sequence of numbers into a given number of lists, so that the sum of each list is equal to each other.
In your problem you'd have a relaxation on the objective, so that you no longer need all the cores to execute the same amount of time, but you want the difference between any two cores execution time to be as small as possible.
In the reference given by svick, you can see a dynamic programming approach that you may be able to map onto your problem.
UPDATE: Here's my implementation of Hashed Timing Wheels. Please let me know if you have an idea to improve the performance and concurrency. (20-Jan-2009)
// Sample usage:
public static void main(String[] args) throws Exception {
Timer timer = new HashedWheelTimer();
for (int i = 0; i < 100000; i ++) {
timer.newTimeout(new TimerTask() {
public void run(Timeout timeout) throws Exception {
// Extend another second.
timeout.extend();
}
}, 1000, TimeUnit.MILLISECONDS);
}
}
UPDATE: I solved this problem by using Hierarchical and Hashed Timing Wheels. (19-Jan-2009)
I'm trying to implement a special purpose timer in Java which is optimized for timeout handling. For example, a user can register a task with a dead line and the timer could notify a user's callback method when the dead line is over. In most cases, a registered task will be done within a very short amount of time, so most tasks will be canceled (e.g. task.cancel()) or rescheduled to the future (e.g. task.rescheduleToLater(1, TimeUnit.SECOND)).
I want to use this timer to detect an idle socket connection (e.g. close the connection when no message is received in 10 seconds) and write timeout (e.g. raise an exception when the write operation is not finished in 30 seconds.) In most cases, the timeout will not occur, client will send a message and the response will be sent unless there's a weird network issue..
I can't use java.util.Timer or java.util.concurrent.ScheduledThreadPoolExecutor because they assume most tasks are supposed to be timed out. If a task is cancelled, the cancelled task is stored in its internal heap until ScheduledThreadPoolExecutor.purge() is called, and it's a very expensive operation. (O(NlogN) perhaps?)
In traditional heaps or priority queues I've learned in my CS classes, updating the priority of an element was an expensive operation (O(logN) in many cases because it can only be achieved by removing the element and re-inserting it with a new priority value. Some heaps like Fibonacci heap has O(1) time of decreaseKey() and min() operation, but what I need at least is fast increaseKey() and min() (or decreaseKey() and max()).
Do you know any data structure which is highly optimized for this particular use case? One strategy I'm thinking of is just storing all tasks in a hash table and iterating all tasks every second or so, but it's not that beautiful.
How about trying to separate the handing of the normal case where things complete quickly from the error cases?
Use both a hash table and a priority queue. When a task is started it gets put in the hash table and if it finishes quickly it gets removed in O(1) time.
Every one second you scan the hash table and any tasks that have been a long time, say .75 seconds, get moved to the priority queue. The priority queue should always be small and easy to handle. This assumes that one second is much less than the timeout times you are looking for.
If scanning the hash table is too slow, you could use two hash tables, essentially one for even-numbered seconds and one for odd-numbered seconds. When a task gets started it is put in the current hash table. Every second move all the tasks from the non-current hash table into the priority queue and swap the hash tables so that the current hash table is now empty and the non-current table contains the tasks started between one and two seconds ago.
There options are a lot more complicated than just using a priority queue, but are pretty easily implemented should be stable.
To the best of my knowledge (I wrote a paper about a new priority queue, which also reviewed past results), no priority queue implementation gets the bounds of Fibonacci heaps, as well as constant-time increase-key.
There is a small problem with getting that literally. If you could get increase-key in O(1), then you could get delete in O(1) -- just increase the key to +infinity (you can handle the queue being full of lots of +infinitys using some standard amortization tricks). But if find-min is also O(1), that means delete-min = find-min + delete becomes O(1). That's impossible in a comparison-based priority queue because the sorting bound implies (insert everything, then remove one-by-one) that
n * insert + n * delete-min > n log n.
The point here is that if you want a priority-queue to support increase-key in O(1), then you must accept one of the following penalties:
Not be comparison based. Actually, this is a pretty good way to get around things, e.g. vEB trees.
Accept O(log n) for inserts and also O(n log n) for make-heap (given n starting values). This sucks.
Accept O(log n) for find-min. This is entirely acceptable if you never actually do find-min (without an accompanying delete).
But, again, to the best of my knowledge, no one has done the last option. I've always seen it as an opportunity for new results in a pretty basic area of data structures.
Use Hashed Timing Wheel - Google 'Hashed Hierarchical Timing Wheels' for more information. It's a generalization of the answers made by people here. I'd prefer a hashed timing wheel with a large wheel size to hierarchical timing wheels.
Some combination of hashes and O(logN) structures should do what you ask.
I'm tempted to quibble with the way you're analyzing the problem. In your comment above, you say
Because the update will occur very very frequently. Let's say we are sending M messages per connection then the overall time becomes O(MNlogN), which is pretty big. – Trustin Lee (6 hours ago)
which is absolutely correct as far as it goes. But most people I know would concentrate on the cost per message, on the theory that as you app has more and more work to do, obviously it's going to require more resources.
So if your application has a billion sockets open simultaneously (is that really likely?) the insertion cost is only about 60 comparisons per message.
I'll bet money that this is premature optimization: you haven't actually measured the bottlenecks in you system with a performance analysis tool like CodeAnalyst or VTune.
Anyway, there's probably an infinite number of ways of doing what you ask, once you just decide that no single structure will do what you want, and you want some combination of the strengths and weaknesses of different algorithms.
One possiblity is to divide the socket domain N into some number of buckets of size B, and then hash each socket into one of those (N/B) buckets. In that bucket is a heap (or whatever) with O(log B) update time. If an upper bound on N isn't fixed in advance, but can vary, then you can create more buckets dynamically, which adds a little complication, but is certainly doable.
In the worst case, the watchdog timer has to search (N/B) queues for expirations, but I assume the watchdog timer is not required to kill idle sockets in any particular order!
That is, if 10 sockets went idle in the last time slice, it doesn't have to search that domain for the one that time-out first, deal with it, then find the one that timed-out second, etc. It just has to scan the (N/B) set of buckets and enumerate all time-outs.
If you're not satisfied with a linear array of buckets, you can use a priority queue of queues, but you want to avoid updating that queue on every message, or else you're back where you started. Instead, define some time that's less than the actual time-out. (Say, 3/4 or 7/8 of that) and you only put the low-level queue into the high-level queue if it's longest time exceeds that.
And at the risk of stating the obvious, you don't want your queues keyed on elapsed time. The keys should be start time. For each record in the queues, elapsed time would have to be updated constantly, but the start time of each record doesn't change.
There's a VERY simple way to do all inserts and removes in O(1), taking advantage of the fact that 1) priority is based on time and 2) you probably have a small, fixed number of timeout durations.
Create a regular FIFO queue to hold all tasks that timeout in 10 seconds. Because all tasks have identical timeout durations, you can simply insert to the end and remove from the beginning to keep the queue sorted.
Create another FIFO queue for tasks with 30-second timeout duration. Create more queues for other timeout durations.
To cancel, remove the item from the queue. This is O(1) if the queue is implemented as a linked list.
Rescheduling can be done as cancel-insert, as both operations are O(1). Note that tasks can be rescheduled to different queues.
Finally, to combine all the FIFO queues into a single overall priority queue, have the head of every FIFO queue participate in a regular heap. The head of this heap will be the task with the soonest expiring timeout out of ALL tasks.
If you have m number of different timeout durations, the complexity for each operation of the overall structure is O(log m). Insertion is O(log m) due to the need to look up which queue to insert to. Remove-min is O(log m) for restoring the heap. Cancelling is O(1) but worst case O(log m) if you're cancelling the head of a queue. Because m is a small, fixed number, O(log m) is essentially O(1). It does not scale with the number of tasks.
Your specific scenario suggests a circular buffer to me. If the max. timeout is 30 seconds and we want to reap sockets at least every tenth of a second, then use a buffer of 300 doubly-linked lists, one for each tenth of a second in that period. To 'increaseTime' on an entry, remove it from the list it's in and add it to the one for its new tenth-second period (both constant-time operations). When a period ends, reap anything left over in the current list (maybe by feeding it to a reaper thread) and advance the current-list pointer.
You've got a hard-limit on the number of items in the queue - there is a limit to TCP sockets.
Therefore the problem is bounded. I suspect any clever data structure will be slower than using built-in types.
Is there a good reason not to use java.lang.PriorityQueue? Doesn't remove() handle your cancel operations in log(N) time? Then implement your own waiting based on the time until the item on the front of the queue.
I think storing all the tasks in a list and iterating through them would be best.
You must be (going to) run the server on some pretty beefy machine to get to the limits where this cost will be important?