Recommend algorithm of fair distributed resources allocation consensus - algorithm

There are distributed computation nodes and there are set of computation tasks represented by rows in a database table (a row per task):
A node has no information about other nodes: can't talk other nodes and doesn't even know how many other nodes there are
Nodes can be added and removed, nodes may die and be restarted
A node connected only to a database
There is no limit of tasks per node
Tasks pool is not finite, new tasks always arrive
A node takes a task by marking that row with some timestamp, so that other nodes don't consider it until some timeout is passed after that timestamp (in case of node death and task not done)
The goal is to evenly distribute tasks among nodes. To achieve that I need to define some common algorithm of tasks acquisition: when a node starts, how many tasks to take?
If a node takes all available tasks, when one node is always busy and others idle. So it's not an option.
A good approach would be for each node to take tasks 1 by 1 with some delay. So
each node periodically (once in some time) checks if there are free tasks and takes only 1 task. In this way, shortly after start all nodes acquire all tasks that are more or less equally distributed. However the drawback is that because of the delay, it would take some time to take last task into processing (say there are 10000 tasks, 10 nodes, delay is 1 second: it would take 10000 tasks * 1 second / 10 nodes = 1000 seconds from start until all tasks are taken). Also the distribution is non-deterministic and skew is possible.
Question: what kind/class of algorithms solve such problem, allowing quickly and evenly distribute tasks using some sync point (database in this case), without electing a leader?
For example: nodes use some table to announce what tasks they want to take, then after some coordination steps they achieve consensus and start processing, etc.

So this comes down to a few factors to consider.
How many tasks are currently available overall?
How many tasks are currently accepted overall?
How many tasks has the node accepted in the last X minutes.
How many tasks has the node completed in the last X minutes.
Can the row fields be modified? (A field added).
Can a node request more tasks after it has finished it's current tasks or must all tasks be immediately distributed?
My inclination is do the following:
If practical, add a "node identifier" field (UUID) to the table with the rows. A node when ran generates a UUID node identifier. When it accepts a task it adds a timestamp and it's UUID. This easily let's other nodes be able to determine how many "active" nodes there are.
To determine allocation, the node determines how many tasks are available/accepted. it then notes how many many unique node identifiers (including itself) have accepted tasks. It then uses this formula to accept more tasks (ideally at random to minimize the chance of competition with other nodes). 2 * available_tasks / active_nodes - nodes_accepted_tasks. So if there are 100 available tasks, 10 active nodes, and this node has accepted 5 task already. Then it would accept: 100 / 10 - 5 = 5 tasks. If nodes only look for more tasks when they no longer have any tasks then you can just use available_tasks / active_nodes.
To avoid issues, there should be a max number of tasks that a node will accept at once.
If node identifier is impractical. I would just say that each node should aim to take ceil(sqrt(N)) random tasks, where N is the number of available tasks. If there are 100 tasks. The first node will take 10, the second will take 10, the 3rd will take 9, the 4th will take 9, the 5th will take 8, and so on. This won't evenly distribute all the tasks at once, but it will ensure the nodes get a roughly even number of tasks. The slight staggering of # of tasks means that the nodes will not all finish their tasks at the same time (which admittedly may or may not be desirable). By not fully distributing them (unless there are sqrt(N) nodes), it also reduces the likelihood of conflicts (especially if tasks are randomly selected) are reduced. It also reduces the number of "failed" tasks if a node goes down.
This of course assumes that a node can request more tasks after it has started, if not, it makes it much more tricky.
As for an additional table, you could actually use that to keep track of the current status of the nodes. Each node records how many tasks it has, it's unique UUID and when it last completed a task. Though that may have potential issues with database churn. I think it's probably good enough to just record which node has accepted the task along with when it accepted it. This is again more useful if nodes can request tasks in the future.

Related

Assign same queue item to multiple workers

We have a logical queue of tasks, where each task has to be assigned to multiple workers.
The numbers of workers to be assigned are based on a configuration of Minimum and Maximum workers.
A worker should not see the same task that they have already completed. It is not necessary that all workers will see all the tasks.
Total number of workers can change dynamically. Each worker can become online or offline anytime.
Each worker can choose to either complete the task or let it expire.
On expiry the task should be assigned to any worker who has not already completed the task.
Is there a good algorithm to solve this scenario ?
The easy solution:
Greedily assign tasks:
For every task that is ready.
Find the Minimum <= N <= Maximum workers that
didn't see the task yet and assign them.
Repeat until you either run
out of workers or finish all the tasks.
If a worker comes online or finishes task, re-check for all the tasks.
If a new task comes, re-check for the available workers.
This solution might be enough if there are not that many tasks, as it is computation heavy and recomputes everything.
The possible optimizations:
If the greedy solution fails(and it probably will), there are ways to improve on it. I will try to list those that come to my mind but it won't be an exhaustive list.
First, my personal favorite: Network flows. Unfortunately I don't see an easy way to resolve the minimal number of workers requirement, however it would be fast and it would result in a maximum possible workers being assigned at any given moment.
Create the network Source - Workers - Tasks - Sink. Edges workers to tasks will be linked and unlinked as needed:
when a worker is available for a task, create the edge with weight 1, otherwise don't create the edge.
From source link an edge with weight one to each online worker.
From every task link an edge with weight equal to its maximal worker capacity.
You could even differentiate between different kinds of workers, network flows are awesome. The algorithms are fast which makes them suitable even for large graphs. Also they are available in many libraries so you won't have to implement them yourself. Unfortunately, there is no easy way to enforce the minimal workers rule. At least I don't see one right now, there might be some way. Or maybe at least a heuristic
Second, be smart while being greedy:
Create a queue for every task.
When a worker is available, register him for every task he can do into its queue.
When a worker is not available, remove him from the queues.
When a task has enough workers, start progress and disable these workers.
This is still the brute force approach however since you keep the queues, you limit the amount of the necessary computation to a reasonable level. The potential downside is that large tasks(with a large minimal number of workers) might be stalled by the small tasks that will be easier to start - and will eat up the workers. So there will probably be some further checking/balancing and prioritizing needed.
There is definitely more to be tried & done for your task at hand, however the information you provided is rather limited so this advice is not that specific.

Max-Min and Min-Min algorithms implementation

I'm trying to simulate the Max-Min and Min-Min scheduling algorithms and code them myself in a simulation. But don't really understand how to implement the way they work in code.
For example, in FCFS algorithm i use 3 servers(vms),each server is faster than the other and 5 tasks with different arrival times. So the first task will check the first server and will be scheduled there, the second if arrive while the first is't completed yet, will check the availability and scheduled to the second server. If all 3 servers are occupied the next task will be scheduled to the one with the min remain executing time.
Now for the Min-Min and Min-Max this is the theoretical background:
Min-Min:
Phase 1: First computes the completion time of every task on each machine and then for every task select the machine which processes the tasks in minimum possible time.
Phase 2: Among all the tasks in Meta task the task with minimum completion time is selected and is assigned to machine on which minimum execution time is expected. The task is removed from the list of Meta Task and the procedure continues until Meta Task list is empty.
Max-Min:
Phase 1: First computes the completion time of every task on each machine and then for every task chooses the machine which processes the tasks in minimum possible time
Phase 2: Among all the tasks in Meta Task the task with maximum completion time is selected and is assigned to machine. The task is removed from the list of Meta Task and the procedure continues until Meta Task list is empty.
I get the phase 1 for both algorithms, I need to check the task's burst time and server's speedup -> burst/speedup = execution time. I will find the best server for each task.
But I can't understand the phase 2. For Min-Min i have to choose every time the fastest task and when i find it I will have to schedule it to the faster server. But the workload will be imbalanced, as I said 3 servers and at least one is the faster, lets say server with ID 1, so every time the tasks will scheduled to this one, I also need the other 2 to work.
Same problem with Max-Min, find the worst task, schedule it to the worst server, but only one server is the worst so the other 2 will not work. How am I suppose to do the balancing and also take in consideration that the tasks arrive in different times?
If you need anything more just let me know and thanks in advance!
You can find nice description of both algorithms in A Comparative Analysis of Min-Min and Max-Min Algorithms based on the Makespan Parameter:
I paste here pseudocode for Min-Min. ETij is execution time for task ti on resource Rj. And rj is ready time of Rj.
That's true that you can have imbalanced load, because all small task will get executed first. Max-Min algorithm overcomes this drawback.
Max-Min algorithm performs the same steps as the Min-Min algorithm but the main difference comes in the second phase, where a task ti is selected which has the maximum completion time instead of minimum completion time as in min-min and assigned to resource Rj, which gives the minimum completion time.

Job scheduling algorithm for cluster

I'm searching for algorithm suitable for problem below:
There are multiple computers(exact number is unknown). Each computer pulls job from some central queue, completes job, then pulls next one. Jobs are produced by some group of users. Some users submit lots of jobs, some a little. Jobs consume equal CPU time(not really, just approximation).
Central queue should be fair when scheduling jobs. Also, users who submitted lots of jobs should have some minimal share of resources.
I'm searching a good algorithm for this scheduling.
Considered two candidates:
Hadoop-like fair scheduler. The problem here is: where can I take minimal shares here when my cluster size is unknown?
Associate some penalty with each user. Increment penalty when user's job is scheduled. Use probability of scheduling job to user as 1 - (normalized penalty). This is something like stride scheduling, but I could not find any good explanation on it.
when I implemented a very similar job runner (for a production system), I ended having each server up choose jobtypes at random. This was my reasoning --
a glut of jobs from one user should not impact the chance of other users having their jobs run (user-user fairness)
a glut of one jobtype should not impact the chance of other jobtypes being run (user-job and job-job fairness)
if there is only one jobtype from one user waiting to run, all servers should be running those jobs (no wasted capacity)
the system should run the jobs "fairly", i.e. proportionate to the number of waiting users and jobtypes and not the total waiting jobs (a large volume of one jobtype should not cause scheduling to favor it) (jobtype fairness)
the number of servers can vary, and is not known beforehand
the waiting jobs, jobtypes and users metadata is known to the scheduler, but not the job data (ie, the usernames, jobnames and counts, but not the payloads)
I also wanted each server to be standalone, to schedule its own work autonomously without having to know about the other servers
The solution I settled on was to track the waiting jobs by their {user,jobtype} attribute tuple, and have each scheduling step randomly select 5 tuples and from each tuple up to 10 jobs to run next. The selected jobs were shortlisted to be run by the next available runner. Whenever capacity freed up to run more jobs (either because jobs finished or because of secondary restrictions they could not run), ran another scheduling step to fetch more work.
Jobs were locked atomically as part of being fetched; the locks prevented them from being fetched again or participating in further scheduling decisions. If they failed to run they were unlocked, effectively returning them to the pool. The locks timed out, so the server running them was responsible for keeping the locks refreshed (if a server crashed, the others would time out its locks and would pick up and run the jobs it started but didn't complete)
For my use case I wanted users A and B with jobs A.1, A.2, A.3 and B.1 to each get 25% of the resources (even though that means user A was getting 75% to user B's 25%). Choosing randomly between the four tuples probabilistically converges to that 25%.
If you want users A and B to each have a 50-50 split of resources, and have A's A.1, A.2 and A.3 get an equal share to B's B.1, you can run a two-level scheduler, and randomly choose users and from those users choose jobs. That will distribute the resources among users equally, and within each user's jobs equally among the jobtypes.
A huge number of jobs of a particular jobtype will take a long time to all complete, but that's always going to be the case. By picking from across users then jobtypes the responsiveness of the job processing will not be adversely impacted.
There are lots of secondary restrictions that can be added (e.g., no more than 5 calls per second to linkedin), but the above is the heart of the system.
You could try Torque resource management and Maui batch job scheduling software from Adaptive Computing. Maui policies are flexible enough to fit your needs. It supports backfill, configurable job and user priorities and resource reservations.

Extract properties of a hadoop job

Given a large datafile and jarfile containing mapper, reducer classes , I want to be able to know , how big Hadoop cluster should be formed ( I mean how many machines I would need to form a cluster for the given job to run efficiently.)
I am running the job on the given datafile(s).
Assuming your MapReduce job scales linearly, I suggest the following test to get a general idea of what you'll need. I assume you have a time in mind when you say "run efficiently"... this might be 1 minute for someone or 1 hour for someone... it's up to you.
Run the job on one node on a subset of your data that fits on one node... or a small number of nodes more preferably. This test cluster should be representative of the type of hardware you will purchase later.
[(time job took on your test cluster) x (number of nodes in test cluster)]
x [(size of full data set) / (size of sample data set)]
/ (new time, i.e., "run efficiently")
= (number of nodes in final cluster)
Some things to note:
If you double the "time job took on test cluster", you'll need twice as many nodes.
If you halve the "new time", i.e., you want your job to run twice as fast, you'll need twice as many nodes.
The ratio of the sample tells you how much to scale the result
An example:
I have a job that takes 30 minutes on a two nodes. I am running this job over 4GB of a 400GB data set (400/4 GB). I would like it if my job took 12 minutes.
(30 minutes x 2 nodes) x (400 / 4) GB / 12 = 500 nodes
This is imperfect in a number of ways:
With one or two nodes, I'm not fully taking into account how long it'll take to transfer stuff over the network... a major part of the mapreduce job. So, you can assume it'll take longer than this estimate. If you can, test your job over 4-10 nodes and scale it from there.
Hadoop doesn't "scale down" well. There is a certain speed limit that you won't be able to cross with MapReduce. Somewhere around 2-3 minutes on most clusters I've seen. That is, you won't be making a MapReduce job run in 3 seconds by having a million nodes.
Your job might not scale linearly, in which case this exercise is flawed.
Maybe you can't find representative hardware. In which case, you'll have to factor in how much faster you think your new system will be.
In summary, there is no super accurate way of doing what you say. The best you can really do right now is experimentation and extrapolation. The more nodes you can do a test on, the better, as the extrapolation part will be more accurate.
In my experience, when testing from something like 200 nodes to 800 nodes, this metric is pretty accurate. I'd be nervous about going from 1 node or 2 nodes to 800. But 20 nodes to 800 might be OK.

What's the best Task scheduling algorithm for some given tasks?

We have a list of tasks with different length, a number of cpu cores and a Context Switch time.
We want to find the best scheduling of tasks among the cores to maximize processor utilization.
How could we find this?
Isn't it like if we choose the biggest available tasks from the list and give them one by one to the current ready cores, it's going to be best or you think we must try all orders to find out which is the best?
I must add that all cores are ready at the time unit 0 and the tasks are supposed to work concurrently.
The idea here is that there's no silver bullet, for what you must consider what are the types of tasks being executed, and try to schedule them as nicely as possible.
CPU-bound tasks don't use much communication (I/O), and thus, need to be continuously executed, and interrupted only when necessary -- according to the policy being used;
I/O-bound tasks may be continuously put aside in the execution, allowing other processes to work, since it will be sleeping for many periods, waiting for data to be retrieved to primary memory;
interative tasks must be continuously executed, but needs not to be executed without interruptions, as it will generate interruptions, waiting for user inputs, but it needs to have a high priority, in order not to let the user notice delays in the execution.
Considering this, and the context switch costs, you must evaluate what types of tasks you have, choosing, thus, one or more policies for your scheduler.
Edit:
I thought this was a simply conceptual question. Considering you have to implement a solution, you must analyze the requirements.
Since you have the length of the tasks, and the context switch times, and you have to maintain the cores busy, this becomes an optimization problem, where you must keep the minimal number of cores idle when it reaches the end of the processes, but you need to maintain the minimum number of context switches, so that your overall execution time does not grow too much.
As pointed by svick, this sounds like a partition problem, which is NP-complete, and in which you need to divide a sequence of numbers into a given number of lists, so that the sum of each list is equal to each other.
In your problem you'd have a relaxation on the objective, so that you no longer need all the cores to execute the same amount of time, but you want the difference between any two cores execution time to be as small as possible.
In the reference given by svick, you can see a dynamic programming approach that you may be able to map onto your problem.

Resources