Given a set of possible starting nodes, find the smallest path that visits certain nodes and returns back - algorithm

I have a set of nodes (<= 10,000) and edges (<= 50,000) which connect all of them with some combination. That is, you can visit any node starting from any other using atleast one combination of edges. The edges have their length defined.
I am supplied a set of mustpass nodes (maximum 5). All of them have to be visited, and we can pass through them multiple times if needed. We need to start our journey from one of the nodes which are not mustpass, visit all mustpass nodes, and return back to our initial node.
We need to find the shortest distance of such a path.
Say we have 5 nodes indexed 1, 2, 3, 4, 5 and the following edges in the format node -> node : length (all undirected):
1 -> 2 : 1
1 -> 5 : 2
3 -> 2 : 3
3 -> 4 : 5
4 -> 2 : 7
4 -> 5 : 10
And the mustpass nodes are 1, 2, 3. Our shortest distance can be achieved when we start from node 5, have a path as such: 5-1-2-3-2-1-5, and hence travel a distance of 12. 12 is our desired output.
Is there an efficient way to approach this?

Here`s O(E log V) solution:
Lets consider starting node as must-pass node
Using Dijkstra find shortest paths between all pairs of "must-pass" nodes
Now problem is reduced to Traveling Salesman Problem with 6 cities
We can either check all permutations in O(6!) time or use dynamic programming for O(2^6 * 6^2) either way since 6 is a constant complexity is O(1) for this part

Related

Calculate the lowest cost of running several resource collectors on an undirected acyclic graph

We have an undirected acyclic graph where there is only a single path between two connected nodes that may look something like this.
Simple guide to the image:
Black numbers: Node id's (note that each node contains a resource collector)
Resource collectors: Every resource collector collects resources on edges that are right next to it (so a collector on node 0 cannot reach resource deposits past node 1 and so on). A resource collector also requires fuel to operate - the amount of fuel a resource collector needs is directly connected to its range - the range determines the furthest resource node it can reach on the allowed edges (the range of the collectors is the blue circle on some nodes in the image). The fuel consumption of a collector is then calculated like this: fuel = radius of the circle (meaning that in the example case node 0 consumes 1 fuel and node 1 and 3 consume 2 fuel each, and since all the resource deposits have been covered our final fuel requirement is 5 (nodes 2, 5 and 4 do not use any fuel, since their radii are 0)).
Black lines: Graph edges
Red dots: Resource deposits, we only receive the number of deposits on a specific edge and all the deposits are evenly spaced apart on their respective edge.
Our task is to find the best configuration for our resource collectors (that is, find the configuration that consumes the lowest amount of fuel, and reaches all resource nodes) By this is meant: set the collector radius.
Things I've tried to solve this problem:
At first I've tried locating the "central" node of the graph and then traversing it with BFS while checking one node ahead and determining the amount of fuel from there, this did work for some graphs, but it became unstable in more complex ones.
After that I've tried basically the same thing, but I chose the leaf nodes as the starting points, this produced similar, imperfect, results.
This is an allocation problem.
Set "cost" of each pair of collector and deposit it can reach to be the distance from collector to deposit. Other pairs, where the deposit is unreachable, have infinite cost and can be omitted from the input.
Set collector/deposit pair with ( next ) lowest cost. Allocate deposit to collector. Repeat until all deposits allocated.
Set radius of each collector to be the distance from the collector to the furthest deposit assigned to it.
Here is the cost of each pair in your example. 031 means the first deposit on the edge from 0 to 3
Note that you renumbered the nodes in your example!!! The numbers in the table refer to the original diagram which looked like this
collector
deposit
cost
0
031
1
0
032
2
0
033
3
3
031
3
3
032
2
3
033
1
3
351
1
3
361
1
3
362
2
3
371
1
3
372
2
5
351
1
5
581
1
5
582
2
6
361
2
6
362
1
7
371
2
7
372
1
8
581
2
8
582
1
Note that this algorithm will assign deposit 231 to collector 2. This is different from your answer, but has the same total fuel cost. However 142 will go to 4 and 152 will go to 5, which increases the total fuel cost.
To handle this situation, collectors with more than 2 connections to other nodes need to be looked at, to see if the fuel cost can be further reduced by increasing their radius, "robbing" deposits from their neighbors.
This is not an allocation problem. There is a greedy optimal solution in O(n).
Here is the essence of the solution:
Let R(V,W) be the # of resource collectors on an edge between V & W. Note: R(V,W) == R(W,V) given the graph is undirected.
Set weight(V) = 0 for all vertices in the graph
1. A connected undirected acyclic graph is a forest.
2. Either the graph has 1 node (terminal case), or there is at least 1 vertex with degree 1. If the graph has 1 node, the algorithm is finished.
3. Select any vertex with degree 1. Let V be the vertex. Let W represent the other vertex connected to V.
4. weight(V) + weight(W) >= # of resource collectors on the edge(V,W)
5. Since V has degree 1, it is strictly better to use fuel from node W.
6. Set weight(W) = max(weight(W), R(V,W) - weight(V)).
7. Remove V and E(V,W) from the graph. The resulting graph is still a connected undirected acyclic graph. Repeat until the graph is a single vertex.
Test case #1:
A - B - C - D - E. All edges have 1 resource collector
Select A. weight(B) = 1, since R(A,B) - weight(A) = 1.
Select B. weight(C) = 0, since R(B,C) - weight(B) = 0.
Select C. weight(D) = 1, since R(C,D) - weight(C) = 1.
Select D. weight(E) = 0, since R(D,E) - weight(D) = 0.
Test case #2:
A - B - C - D - E. R(A,B) = R(D,E) = 1. R(B,C) = R(C,D) = 2.
Select A. weight(B) = 1, since R(A,B) - weight(A) = 1.
Select E. weight(D) = 1, since R(D,E) - weight(E) = 1.
Select B. weight(C) = 1, since R(B,C) - weight(B) = 1.
Select C. weight(D) = 1, since max(weight(D), R(C,D) - weight(C)) = 1.

undirected acyclic graph edge insertion

Is there any algorithm that inserts a node in undirected graph iff graph is acyclic?
for example:
if graph is like below
0 - 1
|
2 - 3
4 - 5
valid insertion : 2-4
0 - 1
|
2 - 3
|
4 - 5
invalid insertion : 1 - 3
0 - 1
| | <=== cyclic!!!
2 - 3
4 - 5
If there is any example code with c++, I would really appreciate.
You can maintain a disjoint-set data structure for the set of vertices. This structure has the following operations:
find(x) to return the identifier of the set where x belongs,
union(x,y) to merge the sets of x and y.
Start with a set for the each vertex.
Before adding an edge, check whether its ends are in the same set.
If not, add the edge and merge the corresponding sets.
For your example, the state of the data structure is the following:
S1 = {0,1,2,3}
S2 = {4,5}
When you try to add an edge 1-3, you get that vertices 1 and 3 belong to the same set S1, so you skip adding.
When you try to add an edge 2-4, you get that vertices 2 and 4 belong to the different sets (S1 and S2, correspondingly), you add this edge, and update the structure to be:
S1 = {0,1,2,3,4,5}

Algorithm - Finding the most rewarding path in a given graph

Question: You are given the following inputs:
3
0 0 1
3 1 1
6 0 9
The first line is the number of points on the graph.
The rest of the lines contain the points on the graph, and their reward. For example:
0 0 1 would mean at point (0,0) [which is the starting point] you are given a reward of 1.
3 1 1 would mean at point (3,1) you are given a reward of 1.
6 0 9 would mean at point (6, 0) you are given a reward of 9.
Going from point a, to point b costs 1.
Therefore if you go from (0,0) -> (3,1) -> (6,0) your reward is 11-2 (cost of traversing 2 nodes) * sqrt(10).
Goal: Determine the maximum amount of rewards you can make (the total amount of reward you collect - the cost) based on the provided inputs.
How would I go about solving this? It seems like dynamic programming is the way to go, but I am not sure where to start.

Maximum path cost in matrix

Can anyone tell the algorithm for finding the maximum path cost in a NxM matrix starting from top left corner and ending with bottom right corner with left ,right , down movement is allowed in a matrix and contains negative cost. A cell can be visited any number of times and after visiting a cell its cost is replaced with 0
Constraints
1 <= nxm <= 4x10^6
INPUT
4 5
1 2 3 -1 -2
-5 -8 -1 2 -150
1 2 3 -250 100
1 1 1 1 20
OUTPUT
37
Explanation is given in the image
Explanation of Output
Since you have also negative costs then use bellman-ford. What you do is that you change sign of all the costs(convert negative signs to positive and positive to negative) then find the shortest path and this path will be the longest because you have changed the signs.
If the sign is never becoms negative then use dijkstra shrtest-path but before that make all values negative and this will return you the longest path with it's cost.
You matrix is a direct graph. In your image you are trying to find a path(max or min) from index (0,0) to (n-1,n-1).
You need these things to represent it as a graph.
You need a linkedlist and in each node you have a first_Node, second_Node,Cost to move from first node to second.
An array of linkedlist. In each array index you save a linkedlist.If for example there is a path from 0 to 5 and 0 to 1(it's an undirected graph) then your graph will look like this.
If you want a direct-graph then simply add in adj[0] = 5 and do not add in adj[5] = 0 , this means that there is path from 0 to 5 but not from 5 to zero.
Here linkedlist represents only nodes which are connected not there cost. You have to add extra variable there which keep cost for each two nodes and it will look like this.
Now instead of first linkedlist put this linkedlist in your array and you have a graph now to run shortest or longest path algorithm.
If you want an intellgent algorithm then you can use A* with heuristic, i guess manhattan will be best.
If cost of your edges is not negative then use Dijkstra.
If cost is negative then use bellman-ford algorithm.
You can always find the longest path by converting the minus sign to plus and plus to minus and then run shortest path algorithm. Path founded will be longest.
I answered this question and as you said in comments to look at point two. If that's a task then main idea of this assignment is ensure the Monotonocity.
h stands for heuristic cost.
A stands for accumulated cost.
Which says that each node the h(A) =< h(A) + A(A,B). Means if you want to move from A to B then cost should not be decreasing(can you do something with your values such that this property will hold) but increasing and once you satisfy this condition then everyone node which A* chooses , that node will be part of your path from source to Goal because this is the path with shortest/longest value.
pathMax You can enforece monotonicity. If there is path from A to B such that f(S...AB) < f(S ..B) then set cost of the f(S...AB) = Max(f(S...AB) , f(S...A)) where S means source.
Since moving up is not allowed, paths always look like a set of horizontal intervals that share at least 1 position (for the down move). Answers can be characterized as, say
struct Answer {
int layer[N][2]; // layer[i][0] and [i][1] represent interval start&end
// with 0 <= layer[i][0] <= layer[i][1] < M
// layer[0][0] = 0, layer[N][1] = M-1
// and non-empty intersection of layers i and i+1
};
An alternative encoding is to note only layer widths and offsets to each other; but you would still have to make sure that the last layer includes the exit cell.
Assuming that you have a maxLayer routine that finds the highest-scoring interval in each layer (const O(M) per layer), and that all such such layers overlap, this would yield an O(N+M) optimal answer. However, it may be necessary to expand intervals to ensure that overlap occurs; and there may be multiple highest-scoring intervals in a given layer. At this point I would model the problem as a directed graph:
each layer has one node per score-maximizing horizontal continuous interval.
nodes from one layer are connected to nodes in the next layer according to the cost of expanding both intervals to achieve at least 1 overlap. If they already overlap, the cost is 0. Edge costs will always be zero or negative (otherwise, either source or target intervals could have scored higher by growing bigger). Add the (expanded) source-node interval value to the connection cost to get an "edge weight".
You can then run Dijkstra on this graph (negate edge weights so that the "longest path" is returned) to find the optimal path. Even better, since all paths pass once and only once through each layer, you only need to keep track of the best route to each node, and only need to build nodes and edges for the layer you are working on.
Implementation details ahead
to calculate maxLayer in O(M), use Kadane's Algorithm, modified to return all maximal intervals instead of only the first. Where the linked algorithm discards an interval and starts anew, you would instead keep a copy of that contender to use later.
given the sample input, the maximal intervals would look like this:
[0]
1 2 3 -1 -2 [1 2 3]
-5 -8 -1 2 -150 => [2]
1 2 3 -250 100 [1 2 3] [100]
1 1 1 1 20 [1 1 1 1 20]
[0]
given those intervals, they would yield the following graph:
(0)
| =>0
(+6)
\ -1=>5
\
(+2)
=>7/ \ -150=>-143
/ \
(+7) (+100)
=>12 \ / =>-43
\ /
(+24)
| =>37
(0)
when two edges incide on a single node (row 1 1 1 1 20), carry forward only the highest incoming value.
For each element in a row, find the maximum cost that can be obtained if we move horizontally across the row, given that we go through that element.
Eg. For the row
1 2 3 -1 -2
The maximum cost for each element obtained if we move horizontally given that we pass through that element will be
6 6 6 5 3
Explanation:
for element 3: we can move backwards horizontally touching 1 and 2. we will not move horizontally forward as the values -1 and -2, reduces the cost value.
So the maximum cost for 3 = 1 + 2 + 3 = 6
The maximum cost matrix for each of elements in a row if we move horizontally, for the input you have given in the description will be
6 6 6 5 3
-5 -7 1 2 -148
6 6 6 -144 100
24 24 24 24 24
Since we can move vertically from one row to the below row, update the maximum cost for each element as follows:
cost[i][j] = cost[i][j] + cost[i-1][j]
So the final cost matrix will be :
6 6 6 5 3
1 -1 7 7 -145
7 5 13 -137 -45
31 29 37 -113 -21
Maximum value in the last row of the above matrix will be give you the required output i.e 37

1000 items, 1000 nodes, 3 items per node, best replication scheme to minimize data loss as nodes fail? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I was wondering what would be the right answer for Question 2-44 in Skiena's Algorithm Design Manual (2nd ed.)
The question is the following:
We have 1,000 data items to store on 1,000 nodes. Each node can store
copies of exactly three different items. Propose a replication scheme
to minimize data loss as nodes fail. What is the expected number of
data entries that get lost when three random nodes fail?
I was thinking about node n having data item from n, n+1 & n+2.
So if 3 consecutive nodes are lost then we lose 1 item.
Is there a better solution?
The approach you propose is not bad but also take a look here. The ideas used in RAID may give you some ideas. For instance if you have 2 data items, than having storage for 3 items you can recover any of them if the other fails. The idea is quite simple - you store the items in 2 nodes and the xor of their bits in the third item. I believe if you utilize this idea you will be able to have more then 3 backups of a single data item(i.e. more then 3 nodes have to fail in order to loose the information).
I thought of methods like RAID levels but Skiena says "each node can store copies of exactly three different items." Even though XOR'red bit patterns of two separate data can be stored in the same amount of space, I did not think that it was something the problem was looking for.
So, I started with what the OP thought of: Store the three copies of each data to its next two neighbors in a striped fashion. For example, the following is for when N==6 and the data are the integers from 0 to 5 (4 and 5 wrap around and use the nodes 0 and 1):
nodes: 0 1 2 3 4 5
===========
copy 0 -> 0 1 2 3 4 5
copy 1 -> 5 0 1 2 3 4
copy 2 -> 4 5 0 1 2 3
Of all the 20 combinations of three-node failures, there are six that lose exactly one piece of data. For example; when nodes 1, 2, and 3 fail, the data 1 gets lost:
===========
0 X X X 4 5
5 X X X 3 4
4 X X X 2 3
Similar for each other data, making 6 of the 20 combinations lose data. Since Skiena does not describe what "data loss" means for the application: Does the loss of a single data point mean that the entire collection is wasted, or losing a single data point is acceptable and is better than losing two?
If the loss of even a single data point means that the entire collection is wasted, then we can do better. Three times better! :)
Instead of distributing the copies of data to the right-hand nodes in a striped fashion, define groups of three nodes that share data. For example, let 0, 1, and 2 share their data and 3, 4, and 5 share their data:
nodes: 0 1 2 3 4 5
===========
copy 0 -> 0 1 2 3 4 5
copy 1 -> 2 0 1 5 3 4
copy 2 -> 1 2 0 4 5 3
This time, there are only 2 of the 20 combinations produce data loss ever. Data 0, 1, and 2 are lost together when nodes 0, 1, and 2 fail:
===========
x x x 3 4 5
x x x 5 3 4
x x x 4 5 3
And data 3, 4, and 5 are lost together when nodes 3, 4, and 5 fail:
===========
0 1 2 x x x
2 0 1 x x x
1 2 0 x x x
That amounts to just 2 of the 20 combinations of three-node failures. When the same nodes share same data, it effectively merges data losses into fewer number of combinations.
Ali
Let,
D = {1,...,d_i,...,d} denote the data items and d_i a given data element
N = {1,...,n_k,...,n} denote the storage cluster and n_k a given storage node.
We say d_i is stored by n_k, loosely denoted by d_i \in n_k.
My replication model has the following assumptions:
1- Every data item must be stored at least in one given node during initialization. I.e.:
Exist at least one 1 <= k <=n s.t. P(d_i \in n_k) = 1.
2- From (1), at initialization time, the probability of d_i to be in a given node is at least 1/n. I.e.:
For any data item 1 <= i <= d and a random node n, P(d_i \in n) >= 1/n.
Given the problem statement, by design, we want to have this distribution uniform across the data set.
3- Lastly, by design, the probability of a data item d_i to be in a given node n should be independent between data items. I.e.:
P(d_i \in n | d_j \in n) = P(d_i \in n)
This is because we don't assume the probability of node failure is independent between adjacent nodes (e.g.: in datacenters adjacent nodes be sharing the same network switch, etc).
From these assumptions, I proposed the following replication model (for the problem instance where d = n and each node stores exactly 3 distinct data items).
(1) Perform a random permutation of data set.
(2) Using a sliding window of length 3 and stride 1, rotate over the shuffled data set and map the data items to each node.
E.g.:
D = {A,B,C,D}
N = {1,2,3,4}
(1) {C, B, A, D}
(2) 1 -> {C, B, A}, 2 -> {B, A, D}, 3-> {A, D, C}, 4-> {D, C, B}
the random shuffling will ensure independent (3) and uniform distribution (2). While the sliding window of stride 1 guarantees (1).
Let's denote, the sliding window of a given node n_k as the ordered set w_k = {w_k1, w_k2, w_k3}. n_k is said to be the master node for w_k1 (first element of w_k). Any other node n_j containing w_k1 is a replica node. N.B.: the proposed replication model guarantees only one master node for any d_i, while the number of replica nodes depends on the window length.
In the example above: n_1 is the master node for C and n_3 and n_4 replica nodes.
Back to the original problem, given this schema, we can state the probability of data loss is the lost of the master node and all replicas for a given data item.
P(d_i is lost) = P(master node for d_i fails and replica 1 fails and replica 2 fails).
without formal proof, an unbiased random permutation in step (1) above would result
P(d_i is lost) = P(master node for d_i fails) * P(replica 1 fails) * P(replica 2 fails).
again, the random permutation is a heuristic to abstract the joint distribution for nodes failure.
From assumptions (2) and (3), P(d_i is lost) = c, for any d_i, at initialization time.
That said for d = n = 1000 and replication factor of 3 (i.e.: window length equals 3).
P(d_i is lost) = 1/1000 * 1/999 * 1/998 ~ 10^-9
Your approach seems essentially correct but can benefit from a failover strategy. Notice that Prof. Skiena has asked "to minimize data loss as nodes fail" which suggests that failing nodes will be a common occurrence.
You may want to have a look at consistent hashing.
Also, there is a great post by reddit engineers about the perils of not using consistent hashing (instead using a fixed MOD hashing).

Resources