Sorting sequences where the binary sorting function return is undefined for some pairs - algorithm

I'm doing some comp. mathematics work where I'm trying to sort a sequence with a complex mathematical sorting predicate, which isn't always defined between two elements in the sequence. I'm trying to learn more about sorting algorithms that gracefully handle element-wise comparisons that cannot be made, as I've only managed a very rudimentary approach so far.
My apologies if this question is some classical problem and it takes me some time to define it, algorithmic design isn't my strong suit.
Defining the problem
Suppose I have a sequence A = {a, b, c, d, e}. Let's define f(x,y) to be a binary function which returns 0 if x < y and 1 if y <= x, by applying some complex sorting criteria.
Under normal conditions, this would provide enough detail for us to sort A. However, f can also return -1, if the sorting criteria is not well-defined for that particular pair of inputs. The undefined-ness of a pair of inputs is commutative, i.e. f(q,r) is undefined if and only if f(r,q) is undefined.
I want to try to sort the sequence A if possible with the sorting criterion that are well defined.
For instance let's suppose that
f(a,d) = f(d,a) is undefined.
All other input pairs to f are well defined.
Then despite not knowing the inequality relation between a and d, we will be able to sort A based on the well-defined sorting criteria as long as a and d are not adjacent to one another in the resulting "sorted" sequence.
For instance, suppose we first determined the relative sorting of A - {d} to be {c, a, b, e}, as all of those pairs to fare well-defined. This could invoke any sorting algorithm, really.
Then we might call f(d,c), and
if d < c we are done - the sorted sequence is indeed {d, c, a, b, e}.
Else, we move to the next element in the sequence, and try to call f(a, d). This is undefined, so we cannot establish d's position from this angle.
We then call f(d, e), and move from right to left element-wise.
If we find some element x where d > x, we are done.
If we end up back at comparing f(a, d) once again, we have established that we cannot sort our sequence based on the well-defined sorting criterion we have.
The question
Is there a classification for these kinds of sorting algorithms, which handle undefined comparison pairs?
Better yet although not expected, is there a well-known "efficient" approach? I have defined my own extremely rudimentary brute-force algorithm which solves this problem, but I am certain it is not ideal.
It effectively just throws out all sequence elements which cannot be compared when encountered, and sorts the remaining subsequence if any elements remain, before exhaustively attempting to place all of the sequence elements which are not comparable to all other elements into the sorted subsequence.
Simply a path on which to do further research into this topic would be great - I lack experience with algorithms and consequently have struggled to find out where I should be looking for some more background on these sorts of problems.

This is very close to topological sorting, with your binary relation being edges. In particular, this is just extending a partial order into a total order. Naively if you consider all pairs using toposort (which is O(V+E)) you have a worst case O(n^2) algorithm (actually O(n+p) with n being the number of elements and p the number of comparable pairs).

Related

Sorting given pairwise orderings

I have n variables (Var 1 ... Var n) and do not know their exact values. The n choose 2 pairwise ordering between these n variables are known. For instance, it is known that Var 5 <= Var 9, Var 9 <= Var 10 and so on for all pairs. Further, it is also known that these pairwise orderings are consistent and do not lead to a degenerate case of equality throughout. That is, in the above example the inequality Var 10 <= Var 5 will not be present.
What is the most efficient sorting algorithm for such problems which gives a sorting of all variables?
Pairwise ordering is the only thing that any (comparison-based) sort needs anyway, so your question boils down to "what's the most efficient comparison-based sorting algorithm".
In answer to that, I recommend you look into Quicksort, Heapsort, Timsort, possibly Mergesort and see what will work well for your case in terms of memory requirements, programming complexity etc.
I find Quicksort the quickest to implement for a once-off program.
The question is not so much how to sort (use the standard sort of your language) but how to feed the sort criterion to the sorting algorithm.
In most languages you need to provide a int comparison (T a, T b) where T is the type of elements, that returns -1, 0 or 1 depending on who is larger.
So you need a fast access to the data structure storing (all) pairwise orderings, given a pair of elements.
So the question is not so much will Var 10 <= Var 5 be present (inconsistent) but more is Var 5 <= Var 10 ensured to be present ? If this is the case, you can test presence of the constraint in O(1) with a hash set of pairs of elements, otherwise, you need to find a transitive relationship between a and b, which might not even exist (it's unclear from OP if we are talking of a partial or total order, i.e. for all a,b we ensure a < b or b < a or a = b (total order).
With roughly worst case N^2 entries, this hash is pretty big. Building it still requires exploring transitive links which is costly.
Following links probably means a map of elements to sets of (immediately) smaller elements, when comparing a to b, if (map(a) contains b) or (map(b) contains a) you can answer immediately, otherwise you need to recurse on the elements of map(a) and map(b), with pretty bad complexity. Ultimately you'll still be cumulating sets of smaller values to build your test.
Perhaps if you have a low number of constraints a <= b, just applying a permutation of a and b when they do not respect the constraint and iterating over the constraints to fixpoint (all constraints applied in one full round with no effect) could be more efficient. At least it's O(1) in memory.
A variant of that could be sorting using a stable sort (preserves order of incomparable entries) several times with subsets of the constraints.
Last idea, computing a Max with your input data is O(number of constraints), so you could just repeatedly compute the Max, add it at the end of the target, remove constraints that use it, rinse and repeat. I'd use a stack to store the largest element up to a given constraint index, so you can backtrack to that rather than restart from scratch.

How to find the subset with the greatest number of items in common?

Let's say I have a number of 'known' sets:
1 {a, b, c, d, e}
2 {b, c, d, e}
3 {a, c, d}
4 {c, d}
I'd like a function which takes a set as an input, (for example {a, c, d, e}) and finds the set that has the highest number of elements, and no more other items in common. In other words, the subset with the greatest cardinality. The answer doesn't have to be a proper subset. The answer in this case would be {a, c, d}.
EDIT: the above example was wrong, now fixed.
I'm trying to find the absolute most efficient way of doing this.
(In the below, I am assuming that the cost of comparing two sets is O(1) for the sake of simplicity. That operation is outside my control so there's no point thinking about it. In truth it would be a function of the cardinality of the two sets being compared.)
Candiate 1:
Generate all subsets of the input, then iterate over the known sets and return the largest one that is a subset. The downside to this is that the complexity will be something like O(n! × m), where n is the cardinality of the input set and m is the number of 'known' subsets.
Candidate 1a (thanks #bratbrat):
Iterate over all 'known' sets and calculate the cardinatlity of the intersection, and take the one with the highest value. This would be O(n) where n is the number of subsets.
Candidate 2:
Create an inverse table and calculate the euclidean distance between the input and the known sets. This could be quite quick. I'm not clear how I could limit this to include only subsets without a subsequent O(n) filter.
Candidate 3:
Iterate over all known sets and compare against the input. The complexity would be O(n) where n is the number of known sets.
I have at my disposal the set functions built into Python and Redis.
None of these seems particularly great. Ideas? The number of sets may get large (around 100,000 at a guess).
There's no possible way to do this in less than O(n) time... just reading the input is O(n).
A couple ideas:
Sort the sets by size (biggest first), and search for the first set which is a subset of the input set. Once you find one, you don't have to examine the rest.
If the number of possible items which could be in the sets is limited, you could represent them by bit-vectors. Then you could calculate a lookup table to tell you whether a given set is a subset of the input set. (Walk down the bits for each input set under consideration, word by word, indexing each word into the appropriate table. If you find an entry telling you that it's not a subset, again, you can move on directly to the next input set.) Whether this would actually buy you performance, depends on the implementation language. I imagine it would be most effective in a language with primitive integral types, like C or Java.
Take the union of the known sets. This becomes a dictionary of known elements.
Sort the known elements by their value (they're integers, right). This defines a given integer's position in a bit string.
Use the above to define bit strings for each of the known sets. This is a one time operation - the results should be stored to avoid recomputation.
For an input set, run it through the same transform to obtain its bit string.
To get the largest subset, run through the list of known bit strings, taking the intersection (logical and) with the input bit string. Count the '1' elements. Remember the largest one.
http://packages.python.org/bitstring
As mentioned in the comments, this can be paralleled up by subdividing the known sets and giving each thread its own subset to work on. Each thread serves up its best match and then the parent thread picks the best from the threads.
How many searches are you making? In case you are searching multiple input sets you should be able to pre-process all the known sets (perhaps as a tree structure) and your search time for each query would be in the order of your query set size.
Eg: Create a Trie structure with all the known sets. Make sure to sort each set before inserting them. For the query, follow the links that are in the set.

How to find all possible pairs from three subsets of a set with constraints in Erlang?

I have a set M which consists of three subsets A,B and C.
Problem: I would like to calculate all possible subsets S(1)...S(N) of M which contain all possible pairs between elements of A, B and C in such manner that:
elements of A and B can happen in a pair only once for each of two positions in a pair (that is {a1,a2} and {b1,a1} can be in one subset S, but no more elements {a1,_} and {_,a1} are allowed in this subset S);
elements of C can happen 1-N times in a subset S (that is {a,c}, {b,c}, {x,c} can happen in one subset S), but I would like to get subsets S for all possible numbers of elements of C in a subset S.
For example, if we have A = [a1,a2], B = [b1,b2], C = [c1,c2], then some of the resulting subsets S would be (remember, they should contain pairs of elements):
- {a1,b1}, {b1,a2}, {a2,b2}, {b2,c1};
- {a1,b1}, {b1,a2}, {a2,b2}, {b2,c1}, {c1,c2};
- {a1,c1}, {c1,a2}, {c1,b2}, {b1,c1};
- etc.
I tend to think that first I need to find all possible subsets of M, which contain only one element of A, one element of B and 1..N elements of C (1). And after that I should somehow generate sets of pairs (2) from that. But I am not sure that this is the right strategy.
So, the more elaborated question would be:
what is the best way to create sets and find subsets in Erlang if the elements of the set M a integers?
are there any ready-made tools to find subsets of a set in Erlang?
are there any ready-made tools to generate all possible pairs of elements of a set in Erlang?
How can I solve the aforementioned problem in Erlang?
There is a sets module*, but I suspect you're better off thinking up an algorithm first -- its implementation in Erlang is the problem (or not) that comes after this. (Maybe you notice its actually a graph algorithm (like, bipartite matching something something), and you'll get happy with Erlang's digraph module.)
Long story short, when you come up with an algorithm, Erlang can very probably be used to implement it. Yes, there is a certain support for sets. But solutions to a problem requiring "all possible subsets" tend to be exponential (i.e., given n elements, there are 2^n subsets; for every element you either have it in your subset or not) and thus bad.
(* there are some modules concerning sets)

Divide and conquer on sorted input with Haskell

For a part of a divide and conquer algorithm, I have the following question where the data structure is not fixed, so set is not to be taken literally:
Given a set X sorted wrt. some ordering of elements and subsets A and B together consisting of all elements in X, can sorted versions A' and B' of A and B be constructed in time linear in the number of elements in X ?
At the moment I am doing a standard sort at each recursive step giving the recursion
T(n) = 2*T(n/2) + O(n*log n)
for the complexity rather than
T(n) = 2*T(n/2) + O(n)
like in the procedural version, where one can utilize a structure with constant-time lookup on A and B to form A' and B' in linear time.
The added log n factor carries over to the overall complexity, giving O(n* (log n)^2) instead of O(n* log n).
EDIT:
Perhaps I am understanding the term lookup incorrectly. The creation of A' and B' in linear time is easy to do if membership of A and B can be checked in constant time.
I didn't succeed in my attempt at making things clearer by abstracting
away the specifics, so here is the actual problem:
I am implementing the algorithm for the closest pair problem. Given a
finite collection P of points in the plane it finds a pair of points
in P with the minimal distance. It works roughly as follows:
If P
has at least 4 points, form Px and
Py, the points in P sorted by x- and y-coordinate. By
splitting Px form L and R, the left- and right-most
halves of points. Recursively compute the closest pair distance in L and
R, let d be the minimum of the two. Now the minimum distance in P is
either d or the distance from a point in L to a point in R. If the
minimal distance is between points from separate halves, it will appear
between a pair of points lying in the strip of width 2*d centered around
the line x = x0, where x0 is the x-coordinate of
a right-most point in L. It turns out that to find a potential minimal distance pair in
the strip, it is enough to compute for every point in the the strip its
distance to the seven following points if the strip points are in a
collection sorted by y-coordinate.
It is in the steps with forming the sorted collections to pass into the recursion and sorting the strip points by y-coordinate where I don't see how to, in
Haskell, utilize having sorted P at the beginning of the recursion.
The following function may interest you:
partition :: (a -> Bool) -> [a] -> ([a], [a])
partition f xs = (filter f xs, filter (not . f) xs)
If you can compute set-membership in constant time, that is, there is a predicate of type a -> Bool that runs in constant time, then partition will run in time linear in the length of its input list. Furthermore, partition is stable, so that if its input list is sorted, then so are both output lists.
I would also like to point out that the above definition is meant to be give the semantics of partition only; the real implementation in GHC only walks its input list once, even if the entire output is forced.
Of course, the real crux of the question is providing a constant-time predicate. The way you phrased the question leaves sets A and B quite unstructured -- you demand that we can handle any particular partitioning. In that case, I don't know of any particularly Haskell-y way of doing constant-time lookup in arbitrary sets. However, often these problems are a bit more structured: often, rather than set-membership, you are actually interested in whether some easily-computable property holds or not. In this case, the above is just what the doctor ordered.
I know very very little about Haskell but here's a shot anyway.
Given that (A+B) == X can;t you just iterate through X (in the sorted order) and add each element to A' or B' if it exists in A or B? Give linear time lookup of element x in the Sets A and B that would be linear.

Comparison Based Ranking Algorithm (Variation)

This question is a variation on a previous question:
Comparison-based ranking algorithm
The variation I would like to pose is: what if loops are solved by discarding the earliest contradicting choices so that a transitive algorithm could actually be used.
Here I have pasted the original question:
"I would like to rank or sort a collection of items (with size potentially greater than 100,000) where each item in the collection does not have an intrinsic (comparable) value, instead all I have is the comparisons between any two items which have been provided by users in a 'subjective' manner.
Example:
Consider a collection with elements [a, b, c, d]. And comparisons by users:
b > a, a > d, d > c
The correct order of this collection would be [b, a, d, c].
This example is simple however there could be more complicated cases:
Since the comparisons are subjective, a user could also say that c > b. In which case that would cause a conflict with the ordering above. Also you may not have comparisons that 'connects' all the items, ie:
b > a, d > c. In which case the ordering is ambiguous. It could be : [b, a, d, c] or [d, c, b, a]. In this case either ordering is acceptable.
...
The Question is:
Is there an algorithm which already exists that can solve the problem above, I would not like to spend effort trying to come up with one if that is the case. If there is no specific algorithm, is there perhaps certain types of algorithms or techniques which you can point me to?"
The simpler version where no "cycle" exists can be dealt with using topological sorting.
Now, for the more complex scenario, if for every "cycle" the order on which the elements appear in the final ranking does not matter, then you could try the following:
model the problem as a directed graph (i.e. the fact that a > b implies that there is an edge in the resulting graph starting in node "a" and ending in node "b").
calculate the strongly connected components (SCC) of the graph. In short, an SCC is a set of nodes with the property that you can get to any node in the set from any node in the set by following a list of edges (this corresponds to your "cycles" in the original problem).
transform the graph by "collapsing" each node into the SCC it belongs to, but preserve the edges that that go between different SCC's.
it turns out the new graph obtained in the way mentioned above is a directly acyclic graph so we can perform a topological sort on it.
Finally, we're ready. The topological sort should tell you the right order in which to print nodes in different SCC's. For the nodes in the same SCC's, no matter what the order you choose is, there will always be "cycles", so a possibility might be printing them in a random order.

Resources