Graph Algorithm load distribution - algorithm

I come across the following problem of distributing load over a number of machines in a network. The problem is an interview question. The candidate should specify which algorithm and which data structures are the best for this problem.
We have N machines in a network. Each machine can accept up to 5 units of load. The requested algorithm receives as input a list of machines with their current load (ranging form 0-5), the distance matrix between the machines, and the new load M that we want to distribute on the network.
The algorithm returns the list of machines that can service the M units of load and have the minimum collective distance. The collective distance is the sum of the distances between the machines in the resulting list.
For example if the resulting list contains three machines A, B and C these machines can service collectively the M units of load (if M=5, A can service 3, B can service 1, C can service 1) and the sum of distances SUM = AB + BC is the smallest path that can collectively service the M units of load.
Do you have any proposals on how to approach it?

The simplest approach I can think of, is defining a value for every machine, something like the summation of inverted distances between this machine and all it's adjacent machines:
v_i = sum(1/dist(i, j) for j in A_i)
(Sorry I couldn't manage to put math formula here)
You can invert the summation again, and call it machine's crowd value (or something like that), but you don't need to.
Then sort machines based on this value (descending if you have inverted the summation value).
Starting with the machine with minimum value (maximum crowd) and add as much as load as you can. Then go for the next machine and do the same until you assign all of the load you want.

It sounds like every machine is able to process the same amount of load -- namely 5 units. And the cost measure you state depends only on the set of machines that have nonzero load (i.e. adding more load to a machine that already has nonzero load will not increase the cost). Therefore the problem can be decomposed:
Find the smallest number k <= n of machines that can perform all jobs. For this step we can ignore the individual identities of the machines and how they are connected.
Once you know the minimum number k of machines necessary, decide which k of the n machines offers the lowest cost.
(1) is a straightforward Bin packing problem. Although this problem is NP-hard, excellent heuristics exist and nearly all instances can be quickly solved to optimality in practice.
There may be linear algebra methods for solving (2) more quickly (if anyone knows of one, feel free to edit or suggest in the comments), but without thinking too hard about it you can always just use branch and bound. This potentially takes time exponential in n, but should be OK if n is low enough, or if you can get a decent heuristic solution that bounds out most of the search space.
(I did try thinking of a DP in which we calculate f(i, j), the lowest cost way of choosing i machines from among machines 1, ..., j, but this runs into the problem that when we try adding the jth machine to f(i - 1, j - 1), the total cost of the edges from the new machine to all existing machines depends on exactly which machines are in the solution for f(i - 1, j - 1), and not just on the cost of this solution, thus violating optimal substructure.)

Related

Minimizing the maximum cost of a task-to-machine assignment

I have m machines and n tasks. There is a m by n cost matrix A where Aij is the cost of executing task j on machine i. Each task must be assigned to exactly one machine, but each machine may accept multiple tasks.
My problem is to find the way to assign the tasks to machines to minimize MakeSpan, the maximum cost of any one machine.
How might I solve this problem? I considered using the Hungarian Algorithm, but it minimizes the total cost, rather than the maximum cost of any one machine.
You can express the problem as an integer linear program. Let B[i][j] be a matrix of 0's and 1's with the meaning that B[i][j] = 1 if we assign the jth task to the ith machine. The fact that the tasks aren't splittable makes this an ILP rather than an LP -- otherwise we could just insist that 0 <= B[i][j] <= 1.
We want to minimize the maximum cost of any machine. That's not a linear function, but there's a standard trick to express it as such, by introducing a dummy variable (which in the program here is called MakeSpan).
The ILP is this program which has m+n constraints:
minimize MakeSpan such that
sum(B[i][j] for i=1..m) = 1 for all j
sum(B[i][j]*A[i][j] for j=1..n) <= MakeSpan for all i
The first set of constraints formalizes the idea that every task is assigned to exactly one machine. The second set of constraints is that the cost of every machine is at most MakeSpan.
The minimum MakeSpan is achieved locally when MakeSpan is equal to the maximum cost of any machine, and achieved globally when that maximum cost is also minimized.
To solve the ILP, you can use your favorite ILP solver. For example, GLPK is an open source solver.

Resource allocation algorithm

I know the algorithm exists but i and having problems naming it and finding a suitable solutions.
My problem is as follows:
I have a set of J jobs that need to be completed.
All jobs take different times to complete, but the time is known.
I have a set of R resources.
Each recourse R may have any number from 1 to 100 instances.
A Job may need to use any number of resources R.
A job may need to use multiple instances of a resource R but never more than the resource R has instances. (if a resource only has 2 instances a job will never need more than 2 instances)
Once a job completes it returns all instances of all resources it used back into the pool for other jobs to use.
A job cannot be preempted once started.
As long as resources allow, there is no limit to the number of jobs that can simultaneously execute.
This is not a directed graph problem, the jobs J may execute in any order as long as they can claim their resources.
My Goal:
The most optimal way to schedule the jobs to minimize run time and/or maximize resource utilization.
I'm not sure how good this idea is, but you could model this as an integer linear program, as follows (not tested)
Define some constants,
Use[j,i] = amount of resource i used by job j
Time[j] = length of job j
Capacity[i] = amount of resource i available
Define some variables,
x[j,t] = job j starts at time t
r[i,t] = amount of resource of type i used at time t
slot[t] = is time slot t used
The constraints are,
// every job must start exactly once
(1). for every j, sum[t](x[j,t]) = 1
// a resource can only be used up to its capacity
(2). r[i,t] <= Capacity[i]
// if a job is running, it uses resources
(3). r[i,t] = sum[j | s <= t && s + Time[j] >= t] (x[j,s] * Use[j,i])
// if a job is running, then the time slot is used
(4). slot[t] >= x[j,s] iff s <= t && s + Time[j] >= t
The third constraint means that if a job was started recently enough that it's still running, then its resource usage is added to the currently used resources. The fourth constraint means that if a job was started recently enough that it's still running, then this time slot is used.
The objective function is the weighted sum of slots, with higher weights for later slots, so that it prefers to fill the early slots. In theory the weights must increase exponentially to ensure using a later time slot is always worse than any configuration that uses only earlier time slots, but solvers don't like that and in practice you can probably get away with using slower growing weights.
You will need enough slots such that a solution exists, but preferably not too many more than you end up needing, so I suggest you start with a greedy solution to give you a hopefully non-trivial upper bound on the number of time slots (obviously there is also the sum of the lengths of all tasks).
There are many ways to get a greedy solution, for example just schedule the jobs one by one in the earliest time slot it will go. It may work better to order them by some measure of "hardness" and put the hard ones in first, for example you could give them a score based on how badly they use a resource up (say, the sum of Use[j,i] / Capacity[i], or maybe the maximum? who knows, try some things) and then order by that score in decreasing order.
As a bonus, you may not always have to solve the full ILP problem (which is NP-hard, so sometimes it can take a while), if you solve just the linear relaxation (allowing the variables to take fractional values, not just 0 or 1) you get a lower bound, and the approximate greedy solutions give upper bounds. If they are sufficiently close, you can skip the costly integer phase and take a greedy solution. In some cases this can even prove the greedy solution optimal, if the rounded-up objective from the linear relaxation is the same as the objective of the greedy solution.
This might be a job for Dykstra's Algorithm. For your case, if you want to maximize resource utilization, then each node in the search space is the result of adding a job to the list of jobs you'll do at once. The edges will then be the resources which are left when you add a job to the list of jobs you'll do.
The goal then, is to find the path to the node which has an incoming edge which is the smallest value.
An alternative, which is more straight forward, is to view this as a knapsack problem.
To construct this problem as an instance of The Knapsack Problem, I'd do the following:
Assuming I have J jobs, j_1, j_2, ..., j_n and R resources, I want to find the subset of J such that when that subset is scheduled, R is minimized (I'll call that J').
in pseudo-code:
def knapsack(J, R, J`):
potential_solutions = []
for j in J:
if R > resources_used_by(j):
potential_solutions.push( knapsack(J - j, R - resources_used_by(j), J' + j) )
else:
return J', R
return best_solution_of(potential_solutions)

Is there an exact algorithm for the minimum makespan scheduling with 2 identical machines and N processes that exists for small constraints?

If 2 identical machines are given, with N jobs with i'th job taking T[i] time to complete, is there an exact algorithm to assign these N jobs to the 2 machines so that the makespan is minimum or the total time required to complete all the N jobs is minimum?
I need to solve the problem only for N=50.
Also note that total execution time of all the processes is bounded by 10000.
Does greedily allocating the largest job to the machine which gets free work?
// s1 -> machine 1
//s2->machine 2 , a[i]-> job[i] ,time-> time consumed,jobs sorted in descending order
// allocated one by one to the machine which is free.
long long ans=INT_MAX;
sort(a,a+n);
reverse(a,a+n);
int i=2;
int s1=a[0];
int s2=a[1];
long long time=min(s1,s2);
s1-=time;
s2-=time;
while(i<n)
{
if(s1==0 && s2==0)
{
s1=a[i];
if(i+1<n) s2=a[i+1];
int c=min(s1,s2);
time+=c;
s1-=c;
s2-=c;
i+=2;
continue;
}
else
{
if(s1<s2) swap(s1,s2);
s2=a[i];
int c=min(s1,s2);
time+=c;
s1-=c;
s2-=c;
i++;
}
}
assert(s1*s2==0);
ans = min(ans,time+max(s1,s2));
The problem you described is NP-hard via a more or less straightforward reduction from Subset Sum, which makes an excat polynomial time algorithm impossible unless P=NP. Greedy assignment will not yield an optimal solution in general. However, as the number of jobs is bounded by 50, any exact algorithm with running time exponential in N is in fact an algorithm with constant running time.
The problem can be tackled via dynamic programming as follows. Let P be the sum of all processing times, which is an upper bound for the optimal makespan. Define an array S[N][P] as state space. The meaning of S[i][j] is the minimum makespan attainable for jobs indexed by 1,...,i where the load of machine 1 is exactly j. An outer loop iterates over the jobs, an inner loop over the target load of machine 1. In each iteration, we have do decide whether job i should run on machine 1 or machine 2. The determination of the state value of course has to be done in such a way that only solutions which exist are taken into account.
In the first case, we set S[i][j] to the minimum of [i-1][j-T[i]] + T[i] (the resulting load of machine 1) and the sum of pi' for i' in {1,...,i-1} minus [i-1][j-T[i]] (the resulting load of machine 2, so to speak the complementary load of machine 1 which is not changed by our choice).
In the second case, we set S[i][j] to the minimum of [i-1][j] (the resulting load of machine 1 which is not changed by our choice) and the sum of T[i'] for i' in {1,...,i-1} minus [i-1][j-T[i]] plus T[i] (the resulting load of machine 2, so to speak the complementary load of machine 1).
Finally, the optimal makespan can be found by determining the minimum value of S[N][j] for each j. Note that the approach only calculates the optimum value, but not an optimal solution itself. An optimal solution can be found by backtracking or using suitable auxiliary data structures. The running time and space requirement would be O(N*P), i.e. pseudopolynomial in N.
Note that the problem and the approach are very similar to the Knapsack problem. However for the scheduling problem, the choice is not to be made whether or not to include an item but whether or not to execute a job on machine 1 or machine 2.
Also note that the problem is actually well-studied; the problem description in so-called three-field notation is P2||Cmax. If I recall correctly, however greedily scheduling jobs in non-increasing order of processing time yields an approximation ratio of 2 as proved in the following article.
R.L. Graham, "Bounds-for certain multiprocessing anomalies," Bell System Technological Journal 45 (1966) 1563-1581

Distance Calculation for massive number of devices/nodes Part 2

This question is an enhancement to the previous SO question.
Distance Calculation for massive number of devices/nodes
I have N mobile devices/nodes (say 100K) and I periodically obtain their location ( latitude , longtitude ) values.
Some of the devices are "logically connected" to roughly M other devices (say 10 in average). My program periodically compares the distance between the each device and its logically connected devices and determines if the distance is within a threshold (say 100 meters).
Furthermore number of logical connections "K" can also be more then one and (say 5 in average)
Example is A can be connected to B,C for i.e. "parents" logic. A can also be connected to C,D,E,F for "work" logic
I need a robust algorithm to calculate these distances to the logically connected devices.
The complexity order of brute force approach would be NMK or (Θ3 in terms of order)
The program does this every 3 seconds (all devices are mobile), thus for instance 100K*10*5 = 5M calculations every 3 seconds is not good.
Any good/classical algorithms for this operation ?
I decided to rewrite my answer after a bit more thought.
The complexity of your problem is not O(N^3) in the worst case, it is actually only O(N^2) in the worst case. It's also not O(N*M*K) but rather O(N*(M+K)), where O(M+K) is O(N). However, the real complexity of your problem is O(E) where E is the total number of logical connections (number of work connections + number of parent connections). Unless you want to approximate, your solution cannot be better than O(E). Your averages suggest that you likely have on the order of 5 million connections, which is on the order of O(N log N).
You example uses two sets of logical connections. So you would simply cycle through each set and check if distance between the devices of the logical connection is within the threshold.
That being said, the example you gave and your assumed time complexity suggests you are interested in more than just if the individual connections are within threshold, but rather if sets of connections are within threshold. Specifically, in your example it would return True if parents logic: (A,B), (A,C) and Work logic (A,C),(A,D),(A,E),(A,F) are all True. In which case your best data structure would be a dictionary of dictionaries that looks like the following in Python (includes the optimization below):
"parentsLogic[A][B] = (last position A, last position B, was within threshold)".
If it's common that the positions don't change much, you may obtain some run-time improvement by storing the previous positions and if they were within the threshold or not (Boolean). The benefit is that you can simply return the previous result if the two positions haven't changed and updating them if they have changed.
You can use a brute force algorithm and sort the result then use the top best groups.
One thing you can do in addition to what was suggested in the answers to the previous question is to store a list of the nearby connected devices for every device and update it only for those devices that have moved by a significant distance since last update (and for the devices connected to those that have moved).
For example, if the threshold is 100 m, store a list of the connected devices within 200 m of every device, and update it for every device that has moved more than by 50 m since last update.

An graph algorithm

There is a algorithm question which I really can't figure it out. The question may use Dijkstra algorithm.
There is a network of n computers that you will hack to take
control. Initially, you have already hacked computer c0
. There are m connections between
computers, through which you can use to take down an uncontrolled computer from a hacked
one. Each connection is described as a triple (ca
; cb
; t), which means if ca
is hacked, then you
can successfully hack cb at a cost of t minutes.
A large group of your hacker friends join you in hacking (they are as good as you and as
many as the computers in the network). They are all at your command, which means
you can assign them hacking tasks on multiple computers simultaneously. Describe an
ecient algorithm to determine how many minutes you would need to successfully hack
all the computers in the network. State the running time in terms of n,m.
Let your computers are labeled as c_0, c_1, ..., c_{n - 1}. After you running Dijkstra the answer you are looking for is max { d[i] | 0 <= i <= n - 1} where d[i] denotes the minimum distance between c_0 and c_i. This is true because: 1) you at least need time equal to maximum of all those distances in order to hack the most distant computer 2) take a look at the tree we got after applying Dijkstra's algorithm (c_0 would be the root of that tree). We can apply the following strategy: first we start off hacking all the neighbors of c_0 and we continue with hacking all computers that have already been hacked. We do this until all the computers have been hacked. We can see that the time needed for this to happen would be equal to maximum depth of the tree (note that the edge of this tree have weight equal to those of the original graph). We can easily see that this is exactly the same number we mentioned before. So the total running time would is equal to Dijkstra's O(m + nlogn)

Resources