Comparison Based Ranking Algorithm (Variation) - algorithm

This question is a variation on a previous question:
Comparison-based ranking algorithm
The variation I would like to pose is: what if loops are solved by discarding the earliest contradicting choices so that a transitive algorithm could actually be used.
Here I have pasted the original question:
"I would like to rank or sort a collection of items (with size potentially greater than 100,000) where each item in the collection does not have an intrinsic (comparable) value, instead all I have is the comparisons between any two items which have been provided by users in a 'subjective' manner.
Example:
Consider a collection with elements [a, b, c, d]. And comparisons by users:
b > a, a > d, d > c
The correct order of this collection would be [b, a, d, c].
This example is simple however there could be more complicated cases:
Since the comparisons are subjective, a user could also say that c > b. In which case that would cause a conflict with the ordering above. Also you may not have comparisons that 'connects' all the items, ie:
b > a, d > c. In which case the ordering is ambiguous. It could be : [b, a, d, c] or [d, c, b, a]. In this case either ordering is acceptable.
...
The Question is:
Is there an algorithm which already exists that can solve the problem above, I would not like to spend effort trying to come up with one if that is the case. If there is no specific algorithm, is there perhaps certain types of algorithms or techniques which you can point me to?"

The simpler version where no "cycle" exists can be dealt with using topological sorting.
Now, for the more complex scenario, if for every "cycle" the order on which the elements appear in the final ranking does not matter, then you could try the following:
model the problem as a directed graph (i.e. the fact that a > b implies that there is an edge in the resulting graph starting in node "a" and ending in node "b").
calculate the strongly connected components (SCC) of the graph. In short, an SCC is a set of nodes with the property that you can get to any node in the set from any node in the set by following a list of edges (this corresponds to your "cycles" in the original problem).
transform the graph by "collapsing" each node into the SCC it belongs to, but preserve the edges that that go between different SCC's.
it turns out the new graph obtained in the way mentioned above is a directly acyclic graph so we can perform a topological sort on it.
Finally, we're ready. The topological sort should tell you the right order in which to print nodes in different SCC's. For the nodes in the same SCC's, no matter what the order you choose is, there will always be "cycles", so a possibility might be printing them in a random order.

Related

Establishing chronology of a list based on a directed graph tree

So, in a personal project I've been working on I came across a following problem, and I've been struggling to come up with a solution since my maths skills are not terribly great.
Lets say you have a following tree of numbers a b c d e f g h:
a
/ \
b c
/ | |
g d f
| |
h e
Each step down the tree means that the next number is bigger then the previous one. So a < b, d < e, a < c. However, it is impossible to determine whether b > c or c < b - we can only tell that both numbers are bigger then a.
Lets say we have an ordered list of numbers, for instance [a, b, c, d, e]. How do we write an algorithm that checks if the order of the numbers in the list (assuming that L[i] < L[i+1]) is, in fact, correct in relation to the information we have accoring to this tree?
I. E, both [a, c, b, d, e] and [a, b, d, c, e] are correct, but [c, a, b, d, e] is not (since we know that c > a but nothing else in relation to how the other numbers are structured).
For the sake of the algorithm, lets assume that our access to the tree is a function provably_greater(X, Y) which returns true if the tree knows that a number is higher then another number. I.E. provably_greater(a, d) = True, but provably_greater(d, f) = False. Naturally if a number is provably not greater, it also returns false.
This is not a homework question, I have abstracted the problem quite a lot to make it more clear, but solving this problem is quite crucial for what I'm trying to do. I've made several attempts at cracking it myself, but everything that I come up with ends up being insufficient for some edge case I find out about later.
Thanks in advance.
Your statement "everything that I come up with ends up being insufficient for some edge case I find out about later" makes it seem that you have no solution at all. Here is a brute-force algorithm that should work in all cases. I can think of several possible ways to improve the speed, but this is a start.
First, set up a data structure that allows a quick evaluation of provably_greater(X, Y) based on the tree. This structure can be a set or hash-table, which will take much memory but allow fast access. For each leaf of the tree, take a path up to the root. At each node, look at all descendant nodes and add an ordered pair to the set that shows the less-than relation for those two nodes. In your example tree, if you start at node h you move up to node g and add (g,h) to the set, then move up to node b and add the pairs (b,h) and (b,g) to the set, then move up to node a and add the pairs (a,h), (a,g), and (a,b) to the set. Do the same for leaf nodes e and f. The pair (a,b) will be added twice to the set, due to the leaf nodes h and e, but a set structure can handle this easily.
The function provably_greater(X, Y) is now quick and easy: the result is True if the pair (Y,X) is in the set and False otherwise.
You now look at all pairs of numbers in your list--for the list [a,b,c,d,e] you would look at the pairs (a,b), (a,c), (b,c), etc. If provably_greater(X, Y) is true for any of those pairs, the list is out of order. Otherwise, the list is in order.
This should be very easy to implement in a language like Python. Let me know if you would like some Python 3 code.
I'm going to ignore your provably_greater function and assume access to the tree so that I can provide an efficient algorithm.
First, perform a Euler tour of the tree, remembering the start and end indexes for every node. If you use the same tree to check a lot of lists, you only have to do this once. See https://www.geeksforgeeks.org/euler-tour-tree/
Create an initially empty binary search tree of indexes
Iterate through the list. For each node, check to see if the tree contains any indexes between its start and end Euler tour indexes. If it does, then the list is out of order. If it does not, then insert its start index into the tree. This will prevent any provably lesser node from appearing later in the list.
That's it -- O(N log N) altogether, and for each list.
A TreeSet in Java or std::set in C++ can be used for the binary search tree.

Transitive closure in bidirected graph

I have a big structure with items and relations between the items.
I need to find all transitive relations for all items. I duplicate all links and use transitive closure. E.g.:
A --- B --- C E --- F --- G
|
|
D
As a result I need to get the pairs:
A-B, A-C, A-D, B-A, B-C, B-D, C-A, C-B, C-D, D-A, D-B, D-C,
E-F, E-G, F-E, F-G, G-E, G-F
For using transitive closure I should use pairs [A-B, B-A, B-C, C-B, B-D, D-B, E-F, F-E, F-G, G-F].
It's big problem for me because the dataset is very large.
The best way to solve my problem would be an algorithm, that allows get all relations using only one-side links (A-B, B-C, C-D, E-F, F-G).
Are there any algorithms to get all relations for each element of the graph without duplicate links?
You may model this problem as a graph problem, and traverse the entire dataset you have, using either DFS(depth-first search) or BFS(breadth-first-search). During traversal, you may assign a component number to each tree in the forest of data you are investigating, and as a result, you can find all the connected components of this graph of data you have. Then for each connected compnent, you may simply form groups of 2 using its members, and use those to describe the relation. If there are odd number of elements, you can pick an already use item and link it to the last remaining one.
This obviously assumes that your goal is to find the connected components alone, and not print the relations, as you put it, in a specific manner. For instance, if you were trying to print the links so that the maximum distance between the items would be as minimal as possible, the problem becomes much more complex.
Another approach which shares the same assumption I mentioned above would be to use the method of union-find, also known as the data structure called disjoint set. You can start with N sets which have N of your items. Then, as you traverse these relations, for each relation (x, y), you unite the sets which contain the items x and y. In the end, all the connected components will be in the same set.
The first approach has O(V + E) time complexity, V and E being the number of items and relations in your data, respectively. The second approach has O(V + E . k(V)) time complexity, where k is a function called Inverse Ackermann, that increases really slowly. (i.e. even slower than logarithmic function)

Sorting sequences where the binary sorting function return is undefined for some pairs

I'm doing some comp. mathematics work where I'm trying to sort a sequence with a complex mathematical sorting predicate, which isn't always defined between two elements in the sequence. I'm trying to learn more about sorting algorithms that gracefully handle element-wise comparisons that cannot be made, as I've only managed a very rudimentary approach so far.
My apologies if this question is some classical problem and it takes me some time to define it, algorithmic design isn't my strong suit.
Defining the problem
Suppose I have a sequence A = {a, b, c, d, e}. Let's define f(x,y) to be a binary function which returns 0 if x < y and 1 if y <= x, by applying some complex sorting criteria.
Under normal conditions, this would provide enough detail for us to sort A. However, f can also return -1, if the sorting criteria is not well-defined for that particular pair of inputs. The undefined-ness of a pair of inputs is commutative, i.e. f(q,r) is undefined if and only if f(r,q) is undefined.
I want to try to sort the sequence A if possible with the sorting criterion that are well defined.
For instance let's suppose that
f(a,d) = f(d,a) is undefined.
All other input pairs to f are well defined.
Then despite not knowing the inequality relation between a and d, we will be able to sort A based on the well-defined sorting criteria as long as a and d are not adjacent to one another in the resulting "sorted" sequence.
For instance, suppose we first determined the relative sorting of A - {d} to be {c, a, b, e}, as all of those pairs to fare well-defined. This could invoke any sorting algorithm, really.
Then we might call f(d,c), and
if d < c we are done - the sorted sequence is indeed {d, c, a, b, e}.
Else, we move to the next element in the sequence, and try to call f(a, d). This is undefined, so we cannot establish d's position from this angle.
We then call f(d, e), and move from right to left element-wise.
If we find some element x where d > x, we are done.
If we end up back at comparing f(a, d) once again, we have established that we cannot sort our sequence based on the well-defined sorting criterion we have.
The question
Is there a classification for these kinds of sorting algorithms, which handle undefined comparison pairs?
Better yet although not expected, is there a well-known "efficient" approach? I have defined my own extremely rudimentary brute-force algorithm which solves this problem, but I am certain it is not ideal.
It effectively just throws out all sequence elements which cannot be compared when encountered, and sorts the remaining subsequence if any elements remain, before exhaustively attempting to place all of the sequence elements which are not comparable to all other elements into the sorted subsequence.
Simply a path on which to do further research into this topic would be great - I lack experience with algorithms and consequently have struggled to find out where I should be looking for some more background on these sorts of problems.
This is very close to topological sorting, with your binary relation being edges. In particular, this is just extending a partial order into a total order. Naively if you consider all pairs using toposort (which is O(V+E)) you have a worst case O(n^2) algorithm (actually O(n+p) with n being the number of elements and p the number of comparable pairs).

Find mode of a multiset in given time bound (most multiplicity)

The given problem:
A multiset is a set in which some of the elements occur more then once (e.g. {a, f, b, b, e, c, b, g, a, i, b} is a multiset). The elements are drawn from a totally ordered set. Present an algorithm, when presented with a multiset as input, finds an element that has the most occurrences in the multiset (e.g. in {a, f, b, b, e, c, b, g, a, c, b}, b has the most occurrences). The algorithm
should run in O(n lg n/M +n) time, where n is the number of elements in the multiset and M is the highest number of occurrences of an element in the multiset. Note that you do not know the value of M.
[Hint: Use a divide-and-conquer strategy based on the median of the list. The subproblems generated by the divide-and-conquer strategy cannot be smaller than a ‘certain’ size in order
to achieve the given time bound.]
Our initial solution:
Our idea was to use Moore's majority algorithm to determine if the multiset contained a majority candidate (eg. {a, b, b} has a majority, b). After determining if this was true or false we either output the result or find the median of the list using a given algorithm (known as Select) and split the list into three sublists (elements less than and equal to the median, and elements greater than the median). Again, we would check each of the lists to determine if the majority element was present, if so, that is your result.
For example, given the multiset {a, b, c, d, d, e, f}
Step 1: check for majority. None found, split the list based on the median.
Step 2: L1 = {a, b, c, d, d}, L2 = {e, f} Find the majority of each. None found, split the lists again.
Step 3: L11 = {a, b, c} L12 = {d, d} L21 = {e} L22 = {f} Check each for majority elements. L12 returns d. In this case, d is the most occurring elements in the original multiset, thus is the answer.
The issues we're having are whether this type of algorithm is fast enough, as well as whether this can be done recursively or if a loop that terminates is required. Like the hint says, the sub-problems cannot be smaller than a 'certain' size, which we believe to be M (the most occurrences).
If you use recursion in a most straight-forward way as described in your post, it will not have the desired time complexity. Why? Let's assume that the answer element is the largest one. Then it is always located in the right branch of recursion. But we call the left branch first, which can go much deeper if all elements are distinct there(getting pieces of size 1, while we do not want to get them smaller than M).
Here is a correct solution:
Let's always split the array into three parts at each step as described in your question. Now let's step aside and take a look at what we have: recursive calls form a tree. To get the desired time complexity, we should never go deeper than the level where the answer is located. To achieve this, we can traverse the tree using a breadth-first search with queue instead of a depth-first search. That's it.
If you want to do this in real life, it is worth considering using a hash table to track the counts. This can have amortized O(1) complexity per hash table access, so the overall complexity of the following Python code is O(n).
import collections
C = collections.Counter(['a','f','b','b','e','c','b','g','a','i','b'])
most_common_element, highest_count = C.most_common(1)[0]

Hierarchical undirected graph representation

I need to represent the graph like this:
Graph = graph([Object1,Object2,Object3,Object4],
[arc(Object1,Object2,connected),
arc(Object2,Object4,connected),
arc(Object3,Object4,connected),
arc(Object1,Object3,connected),
arc(Object2,Object3,parallel),
arc(Object1,Object4,parallel),
arc(Object2,Object3,similar_size),
arc(Object1,Object4,similar_size)])
I have no restriction for code, however I'd stick to this representation as it fits all the other structures I've already coded.
What I mean is the undirected graph in which vertices are some objects and edges representing undirected relations between them. To give you more background in this particular example I'm trying to represent a rectangle, so objects are its four edges(segments). Those segments are represented in the same way with use of vertices and so on. The point is to build the hierarchy of graphs which would represent constraints between objects on the same level.
The problem lays in the representation of edges. The most obvious way to represent an arc (a,b) would be to put both (a,b) and (b,a) in the program. This however floods my program with redundant data exponentialy. For example if I have vertices a,b,c,d. I can build segments (a,b),(a,c),(a,d),(b,c),(b,d),(c,d). But I get also (b,a),(c,a), and so on. At this points its not a problem. But later I build a rectangle. It can be build of segments (a,b),(b,c),(c,d),(a,d). And I'd like to get the answer - there's one rectangle. You can calculate however how many combination of this one rectangle I get. It also take too much time to calculate and obviously I don't want to finish at the rectangle level.
I thought about sorting the elements. I can sort vertices in a segment. But if I want to sort segments in a rectangle the constraints are no longer valid. The graph becomes directed. For example taking into consideration the first two relations let's say we have arcs (a,b) and (a,c). If arcs are not sorted the program answers as I want it to: arc(b,a,connected),arc(a,c,connected) with match: Object1=b,Object2=a,Object4=c. If I sort elements it's no longer valid as I cannot have arc(b,a,connected) and arc(a,b,connected) tried out. Only the second one. I'd stick with the sorting but I have no idea how to solve this last issue.
Hopefully I stated all of this quite clearly. I'd prefer to stay as close to the representation and ideas I already have. But completely new ones are also very welcome. I don't expect any exact answer, rather poitning me in the right direction or suggesting something specific to read as I'm quite new to Prolog and maybe this problem is not as uncommon as I think.
I'm trying to solve this since yesterday and couldn't come up with any easy answer. I looked at some discrete math and common undirected graphs representation like adjacency list. Let me know if anything is unclear - I'll try to provide more details.
Interesting question although a bit broad since it is not stated what you actually want to do with the arcs, rectangles etc; a representation may be efficient (time/space/elegance) only with certain uses. In any case, here are some ideas:
Sorting
the obvious issue is the one you mentioned; you can solve it by introducing a clause that succeeds if the sorted pair exists:
arc(X,Y):-
arc_data(X,Y)
; arc_data(Y,X).
note that you should not do something like:
arc(a,b).
arc(b,c).
arc(X,Y):-
arc(Y,X)
since this will result in a infinite loop if the arc does not exist.
you could however only check if the first arg is larger than the second:
arc(a,b).
arc(b,c).
arc(X,Y):-
compare(>,X,Y),
arc(Y,X)
This approach will not resolve the multiple solutions that may arise due to having an arc represented in two ways.
The easy fix would be to only check for one solution where only one solution is expected using once/1:
3 ?- arc(X,Y).
X = a,
Y = b ;
X = b,
Y = a.
4 ?- once(arc(X,Y)).
X = a,
Y = b.
Of course you cannot do this when there could be multiple solutions.
Another approach would be to enforce further abstraction: at the moment, when you have two points (a, b) you can create the arc (arc(a,b) or arc(b,a)) after checking if those points are connected. Instead of that, you should create the arc through a predicate (that could also check if the points are connected). The benefit is that you no longer get involved in the representation of the arc directly and can thus enforce sorting (yes, it's basically object orientation):
cv_arc(X,Y,Arc):-
( arc(X,Y),
Arc = arc(X,Y))
; ( arc(Y,X),
Arc = arc(Y,X)).
(assuming as a database arc(a,b)):
6 ?- cv_arc(a,b,A).
A = arc(a, b).
7 ?- cv_arc(b,a,A).
A = arc(a, b).
8 ?- cv_arc(b,c,A).
false.
Of course you would need to follow a similar principle for the rest of the objects; I assume that you are doing something like this to find a rectangle:
rectangle(A,B,C,D):-
arc(A,B),
arc(B,C),
arc(C,D),
arc(D,A).
besides the duplicates due to the arc (which are resolved) this would recognise ABCD, DABC etc as different rectangles:
28 ?- rectangle(A,B,C,D).
A = a,
B = b,
C = c,
D = d ;
A = b,
B = c,
C = d,
D = a ;
A = c,
B = d,
C = a,
D = b ;
A = d,
B = a,
C = b,
D = c.
We will do the same again:
rectangle(rectangle(A,B,C,D)):-
cv_arc(A,B,AB),
cv_arc(B,C,BC),
compare(<,AB,BC),
cv_arc(C,D,CD),
compare(<,BC,CD),
cv_arc(D,A,DA),
compare(<,CD,DA).
and running with arc(a,b). arc(b,c). arc(c,d). arc(a,d).:
27 ?- rectangle(R).
R = rectangle(a, b, c, d) ;
false.
Note that we did not re-order the rectangle if the arcs were in the wrong order; we simply failed it. This way we avoided duplicate solutions (if we ordered them and accepted it as a valid rectangle we would have the same rectangle four times) but the time spent to find the rectangle increases. We reduced the overhead by stopping the search at the first arc that is out of order instead of creating the whole rectangle. Also, the overhead would also be reduced if the arcs are ordered (since the first match would be ordered). On the other hand, if we consider the complexity of searching for all rectangles this way, the overhead is not that significant. Also, it only applies if we want just the first rectangle; should we want to get more solutions or ensure that there are no other solutions, prolog will search the whole tree, whether it reports the solutions or not.

Resources