Transitive closure in bidirected graph - algorithm

I have a big structure with items and relations between the items.
I need to find all transitive relations for all items. I duplicate all links and use transitive closure. E.g.:
A --- B --- C E --- F --- G
|
|
D
As a result I need to get the pairs:
A-B, A-C, A-D, B-A, B-C, B-D, C-A, C-B, C-D, D-A, D-B, D-C,
E-F, E-G, F-E, F-G, G-E, G-F
For using transitive closure I should use pairs [A-B, B-A, B-C, C-B, B-D, D-B, E-F, F-E, F-G, G-F].
It's big problem for me because the dataset is very large.
The best way to solve my problem would be an algorithm, that allows get all relations using only one-side links (A-B, B-C, C-D, E-F, F-G).
Are there any algorithms to get all relations for each element of the graph without duplicate links?

You may model this problem as a graph problem, and traverse the entire dataset you have, using either DFS(depth-first search) or BFS(breadth-first-search). During traversal, you may assign a component number to each tree in the forest of data you are investigating, and as a result, you can find all the connected components of this graph of data you have. Then for each connected compnent, you may simply form groups of 2 using its members, and use those to describe the relation. If there are odd number of elements, you can pick an already use item and link it to the last remaining one.
This obviously assumes that your goal is to find the connected components alone, and not print the relations, as you put it, in a specific manner. For instance, if you were trying to print the links so that the maximum distance between the items would be as minimal as possible, the problem becomes much more complex.
Another approach which shares the same assumption I mentioned above would be to use the method of union-find, also known as the data structure called disjoint set. You can start with N sets which have N of your items. Then, as you traverse these relations, for each relation (x, y), you unite the sets which contain the items x and y. In the end, all the connected components will be in the same set.
The first approach has O(V + E) time complexity, V and E being the number of items and relations in your data, respectively. The second approach has O(V + E . k(V)) time complexity, where k is a function called Inverse Ackermann, that increases really slowly. (i.e. even slower than logarithmic function)

Related

Select the most elements that do not overlap so that the sum of their size is maximized

I'm trying to find an algorithm to the following problem.
Say I have a number of objects A, B, C,...
I have a list of valid combinations of these objects. Each combination is of length 2 or 4.
For eg. AF, CE, CEGH, ADFG,... and so on.
For combinations of two objects, eg. AF, the length of the combination is 2. For combination of four objects, eg CEGH, the length of the combination.
I can only pick non-overlapping combinations, i.e. I cannot pick AF and ADFG because both require objects 'A' and 'F'. I can pick combinations AF and CEGH because they do not require common objects.
If my solution consists of only the two combinations AF and CEGH, then my objective is the sum of the length of the combinations, which is 2 + 4 = 6.
Given a list of objects and their valid combinations, how do I pick the most valid combinations that don't overlap with each other so that I maximize the sum of the lengths of the combinations? I do not want to formulate it as an IP as I am working with a problem instance with 180 objects and 10 million valid combinations and solving an IP using CPLEX is prohibitively slow. Looking for some other elegant way to solve it. Can I perhaps convert this to a network? And solve it using a max-flow algorithm? Or a Dynamic program? Stuck as to how to go about solving this problem.
My first attempt at showing this problem to be NP-hard was wrong, as it did not take into account the fact that only combinations of size 2 or 4 were allowed. However, using Jim D.'s suggestion to reduce from 3-dimensional matching (3DM), we can show that the problem is nevertheless NP-hard.
I'll show that the natural decision problem form of your problem ("Given a set O of objects, and a set C of combinations of either 2 or 4 objects from O, and an integer m, does there exist a subset D of C such that all sets in D are pairwise disjoint, and the union of all sets in D has size at least m?") is NP-hard. Clearly the optimisation problem (i.e., your original problem, where we seek an actual subset of combinations that maximises m above) is at least as hard as this problem. (To see that the optimisation problem is not "much" harder than the decision problem, notice that you could first find the maximum m value for which a solution exists using a binary search on m in which you solve a decision problem at each step, and then, once this maximal m value has been found, solving a series of decision problems in which each combination in turn is removed: if the solution after removing some particular combination is still "YES", then it may also be left out of all future problem instances, while if the solution becomes "NO", then it is necessary to keep this combination in the solution.)
Given an instance (X, Y, Z, T, k) of 3DM, where X, Y and Z are sets that are pairwise disjoint from each other, T is a subset of X*Y*Z (i.e., a set of ordered triples with first, second and third components from X, Y and Z, respectively) and k is an integer, our task is to determine whether there is any subset U of T such that |U| >= k and all triples in U are pairwise disjoint (i.e., to answer the question, "Are there at least k non-overlapping triples in T?"). To turn any such instance of 3DM into an instance of your problem, all we need to do is create a fresh 4-combination from each triple in T, by adding a distinct dummy value to each. The set of objects in the constructed instance of your problem will consist of the union of X, Y, Z, and the |T| dummy values we created. Finally, set m to k.
Suppose that the answer to the original 3DM instance is "YES", i.e., there are at least k non-overlapping triples in T. Then each of the k triples in such a solution corresponds to a 4-combination in the input C to your problem, and no two of these 4-combinations overlap, since by construction, their 4th elements are all distinct, and by assumption of the. Thus there are at least m = k non-overlapping 4-combinations in the instance of your problem, so the solution for that problem must also be "YES".
In the other direction, suppose that the solution to the constructed instance of your problem is "YES", i.e., there are at least m non-overlapping 4-combinations in C. We can simply take the first 3 elements of each of the 4-combinations (throwing away the fourth) to produce a set of k = m non-overlapping triples in T, so the answer to the original 3DM instance must also be "YES".
We have shown that a YES-answer to one problem implies a YES-answer to the other, thus a NO-answer to one problem implies a NO-answer to the other. Thus the problems are equivalent. The instance of your problem can clearly be constructed in polynomial time and space. It follows that your problem is NP-hard.
You can reduce this problem to the maximum weighted clique problem, which is, unfortunately, NP-hard.
Build a graph such that every combination is a vertex with weight equal to the length of the combination, and connect vertices if the corresponding combinations do not share any object (i.e. if you can pick both them at the same time). Then, a solution is valid if and only if it is a clique in that graph.
A simple search on google brings up a lot of approximation algorithms for this problem, such as this one.

Sorting sequences where the binary sorting function return is undefined for some pairs

I'm doing some comp. mathematics work where I'm trying to sort a sequence with a complex mathematical sorting predicate, which isn't always defined between two elements in the sequence. I'm trying to learn more about sorting algorithms that gracefully handle element-wise comparisons that cannot be made, as I've only managed a very rudimentary approach so far.
My apologies if this question is some classical problem and it takes me some time to define it, algorithmic design isn't my strong suit.
Defining the problem
Suppose I have a sequence A = {a, b, c, d, e}. Let's define f(x,y) to be a binary function which returns 0 if x < y and 1 if y <= x, by applying some complex sorting criteria.
Under normal conditions, this would provide enough detail for us to sort A. However, f can also return -1, if the sorting criteria is not well-defined for that particular pair of inputs. The undefined-ness of a pair of inputs is commutative, i.e. f(q,r) is undefined if and only if f(r,q) is undefined.
I want to try to sort the sequence A if possible with the sorting criterion that are well defined.
For instance let's suppose that
f(a,d) = f(d,a) is undefined.
All other input pairs to f are well defined.
Then despite not knowing the inequality relation between a and d, we will be able to sort A based on the well-defined sorting criteria as long as a and d are not adjacent to one another in the resulting "sorted" sequence.
For instance, suppose we first determined the relative sorting of A - {d} to be {c, a, b, e}, as all of those pairs to fare well-defined. This could invoke any sorting algorithm, really.
Then we might call f(d,c), and
if d < c we are done - the sorted sequence is indeed {d, c, a, b, e}.
Else, we move to the next element in the sequence, and try to call f(a, d). This is undefined, so we cannot establish d's position from this angle.
We then call f(d, e), and move from right to left element-wise.
If we find some element x where d > x, we are done.
If we end up back at comparing f(a, d) once again, we have established that we cannot sort our sequence based on the well-defined sorting criterion we have.
The question
Is there a classification for these kinds of sorting algorithms, which handle undefined comparison pairs?
Better yet although not expected, is there a well-known "efficient" approach? I have defined my own extremely rudimentary brute-force algorithm which solves this problem, but I am certain it is not ideal.
It effectively just throws out all sequence elements which cannot be compared when encountered, and sorts the remaining subsequence if any elements remain, before exhaustively attempting to place all of the sequence elements which are not comparable to all other elements into the sorted subsequence.
Simply a path on which to do further research into this topic would be great - I lack experience with algorithms and consequently have struggled to find out where I should be looking for some more background on these sorts of problems.
This is very close to topological sorting, with your binary relation being edges. In particular, this is just extending a partial order into a total order. Naively if you consider all pairs using toposort (which is O(V+E)) you have a worst case O(n^2) algorithm (actually O(n+p) with n being the number of elements and p the number of comparable pairs).

What algorithm can I use to verify that a list of nodes can be connected, given some constraints?

I'm building a game where the player is given a random set of nodes and attempts to build the longest list they can by placing the nodes in a certain order. Each node has zero or more connections on the sides that have to match with at least one connection on the side of the next node in the list. For example, a node might look like this:
+--+
left connections A B right connections
B C
+--+
The above node (example node) could be connected with any of these nodes:
+--+
C | This node can connect to the right side of the example node (matches C)
D |
+--+
+--+
B K This node can connect to the left side of the example node (matches A)
L A This node can connect to the right side of the example node (matches B)
+--+
So, given those three nodes, the player could match them up in a list like so:
+--+ +--+ +--+
B K A B C |
L A -A- B C -C- D |
+--+ +--+ +--+
I need to validate that the player's choices. The player doesn't have to select the nodes in the correct order at first, but their resulting final selections must be able to connect into a contiguous, linear list.
So, given an array of unordered nodes (the players selection), I need to form the nodes into a valid list like above, or show an error to the player.
I can brute force the validation, but I was hoping to find a more elegant solution.
After hash up and some precalculations the problem may look like this:
Given a graph determine whether it has a path traversing all nodes
which is exactly the Hamiltonian problem. You may read researches on this topic or analyze certain structure of your graph (for some special graphs it has simple solutions), but in general case the best solution I know is exponential.
However the straightforward brute force solution is to go through all permutations (n!) and check whether it forms proper path (*n). This approach results in O(n*n!) asymptotic. Effectively it means that n should be about 12 at maximum for sub-second check and for n=15 it will take several hours to check in the worst case.
Slight optimisation - forming the path gradually and checking on every new vertex - results in O(n!) time in the worst case, so it will be possible to check for n=13 in a couple of seconds in the worst case and even faster in average as a lot of false paths will be cut at early stages.
You may go further and take advantage of dynamic programming. Let us define isProperPath[S][i] (where S is a bitmask of subset of nodes, i is some node of this subset) as value corresponding to existence of path constructed of nodes of subset corresponding to S with the last node i. Then it is easy to compute isProperPath[S][i] based on all values for subsets with less elements then in S:
isProperPath[S][i] = false;
for (j in S) {
if (isProperPath[S\i][j] && hasConnection(j, i))
isProperPath[S][i] = true;
}
By traversing all pairs of S and i in order of ascending size of S we will compute all values. And the answer is true if and only if isProperPath[A][i] = true where A is the whole given set and i - any of nodes.
Time complexity is O(2^n*n^2), space complexity is O(2^n*n) as there are 2^n*n values and it takes O(n) to compute value based on previous values. This algorithm makes it possible to check sets of size up to 24 in a sub-second time utilizing about 400M bits or 50Mb.

How to find all possible pairs from three subsets of a set with constraints in Erlang?

I have a set M which consists of three subsets A,B and C.
Problem: I would like to calculate all possible subsets S(1)...S(N) of M which contain all possible pairs between elements of A, B and C in such manner that:
elements of A and B can happen in a pair only once for each of two positions in a pair (that is {a1,a2} and {b1,a1} can be in one subset S, but no more elements {a1,_} and {_,a1} are allowed in this subset S);
elements of C can happen 1-N times in a subset S (that is {a,c}, {b,c}, {x,c} can happen in one subset S), but I would like to get subsets S for all possible numbers of elements of C in a subset S.
For example, if we have A = [a1,a2], B = [b1,b2], C = [c1,c2], then some of the resulting subsets S would be (remember, they should contain pairs of elements):
- {a1,b1}, {b1,a2}, {a2,b2}, {b2,c1};
- {a1,b1}, {b1,a2}, {a2,b2}, {b2,c1}, {c1,c2};
- {a1,c1}, {c1,a2}, {c1,b2}, {b1,c1};
- etc.
I tend to think that first I need to find all possible subsets of M, which contain only one element of A, one element of B and 1..N elements of C (1). And after that I should somehow generate sets of pairs (2) from that. But I am not sure that this is the right strategy.
So, the more elaborated question would be:
what is the best way to create sets and find subsets in Erlang if the elements of the set M a integers?
are there any ready-made tools to find subsets of a set in Erlang?
are there any ready-made tools to generate all possible pairs of elements of a set in Erlang?
How can I solve the aforementioned problem in Erlang?
There is a sets module*, but I suspect you're better off thinking up an algorithm first -- its implementation in Erlang is the problem (or not) that comes after this. (Maybe you notice its actually a graph algorithm (like, bipartite matching something something), and you'll get happy with Erlang's digraph module.)
Long story short, when you come up with an algorithm, Erlang can very probably be used to implement it. Yes, there is a certain support for sets. But solutions to a problem requiring "all possible subsets" tend to be exponential (i.e., given n elements, there are 2^n subsets; for every element you either have it in your subset or not) and thus bad.
(* there are some modules concerning sets)

Comparison Based Ranking Algorithm (Variation)

This question is a variation on a previous question:
Comparison-based ranking algorithm
The variation I would like to pose is: what if loops are solved by discarding the earliest contradicting choices so that a transitive algorithm could actually be used.
Here I have pasted the original question:
"I would like to rank or sort a collection of items (with size potentially greater than 100,000) where each item in the collection does not have an intrinsic (comparable) value, instead all I have is the comparisons between any two items which have been provided by users in a 'subjective' manner.
Example:
Consider a collection with elements [a, b, c, d]. And comparisons by users:
b > a, a > d, d > c
The correct order of this collection would be [b, a, d, c].
This example is simple however there could be more complicated cases:
Since the comparisons are subjective, a user could also say that c > b. In which case that would cause a conflict with the ordering above. Also you may not have comparisons that 'connects' all the items, ie:
b > a, d > c. In which case the ordering is ambiguous. It could be : [b, a, d, c] or [d, c, b, a]. In this case either ordering is acceptable.
...
The Question is:
Is there an algorithm which already exists that can solve the problem above, I would not like to spend effort trying to come up with one if that is the case. If there is no specific algorithm, is there perhaps certain types of algorithms or techniques which you can point me to?"
The simpler version where no "cycle" exists can be dealt with using topological sorting.
Now, for the more complex scenario, if for every "cycle" the order on which the elements appear in the final ranking does not matter, then you could try the following:
model the problem as a directed graph (i.e. the fact that a > b implies that there is an edge in the resulting graph starting in node "a" and ending in node "b").
calculate the strongly connected components (SCC) of the graph. In short, an SCC is a set of nodes with the property that you can get to any node in the set from any node in the set by following a list of edges (this corresponds to your "cycles" in the original problem).
transform the graph by "collapsing" each node into the SCC it belongs to, but preserve the edges that that go between different SCC's.
it turns out the new graph obtained in the way mentioned above is a directly acyclic graph so we can perform a topological sort on it.
Finally, we're ready. The topological sort should tell you the right order in which to print nodes in different SCC's. For the nodes in the same SCC's, no matter what the order you choose is, there will always be "cycles", so a possibility might be printing them in a random order.

Resources