Establishing chronology of a list based on a directed graph tree - algorithm

So, in a personal project I've been working on I came across a following problem, and I've been struggling to come up with a solution since my maths skills are not terribly great.
Lets say you have a following tree of numbers a b c d e f g h:
a
/ \
b c
/ | |
g d f
| |
h e
Each step down the tree means that the next number is bigger then the previous one. So a < b, d < e, a < c. However, it is impossible to determine whether b > c or c < b - we can only tell that both numbers are bigger then a.
Lets say we have an ordered list of numbers, for instance [a, b, c, d, e]. How do we write an algorithm that checks if the order of the numbers in the list (assuming that L[i] < L[i+1]) is, in fact, correct in relation to the information we have accoring to this tree?
I. E, both [a, c, b, d, e] and [a, b, d, c, e] are correct, but [c, a, b, d, e] is not (since we know that c > a but nothing else in relation to how the other numbers are structured).
For the sake of the algorithm, lets assume that our access to the tree is a function provably_greater(X, Y) which returns true if the tree knows that a number is higher then another number. I.E. provably_greater(a, d) = True, but provably_greater(d, f) = False. Naturally if a number is provably not greater, it also returns false.
This is not a homework question, I have abstracted the problem quite a lot to make it more clear, but solving this problem is quite crucial for what I'm trying to do. I've made several attempts at cracking it myself, but everything that I come up with ends up being insufficient for some edge case I find out about later.
Thanks in advance.

Your statement "everything that I come up with ends up being insufficient for some edge case I find out about later" makes it seem that you have no solution at all. Here is a brute-force algorithm that should work in all cases. I can think of several possible ways to improve the speed, but this is a start.
First, set up a data structure that allows a quick evaluation of provably_greater(X, Y) based on the tree. This structure can be a set or hash-table, which will take much memory but allow fast access. For each leaf of the tree, take a path up to the root. At each node, look at all descendant nodes and add an ordered pair to the set that shows the less-than relation for those two nodes. In your example tree, if you start at node h you move up to node g and add (g,h) to the set, then move up to node b and add the pairs (b,h) and (b,g) to the set, then move up to node a and add the pairs (a,h), (a,g), and (a,b) to the set. Do the same for leaf nodes e and f. The pair (a,b) will be added twice to the set, due to the leaf nodes h and e, but a set structure can handle this easily.
The function provably_greater(X, Y) is now quick and easy: the result is True if the pair (Y,X) is in the set and False otherwise.
You now look at all pairs of numbers in your list--for the list [a,b,c,d,e] you would look at the pairs (a,b), (a,c), (b,c), etc. If provably_greater(X, Y) is true for any of those pairs, the list is out of order. Otherwise, the list is in order.
This should be very easy to implement in a language like Python. Let me know if you would like some Python 3 code.

I'm going to ignore your provably_greater function and assume access to the tree so that I can provide an efficient algorithm.
First, perform a Euler tour of the tree, remembering the start and end indexes for every node. If you use the same tree to check a lot of lists, you only have to do this once. See https://www.geeksforgeeks.org/euler-tour-tree/
Create an initially empty binary search tree of indexes
Iterate through the list. For each node, check to see if the tree contains any indexes between its start and end Euler tour indexes. If it does, then the list is out of order. If it does not, then insert its start index into the tree. This will prevent any provably lesser node from appearing later in the list.
That's it -- O(N log N) altogether, and for each list.
A TreeSet in Java or std::set in C++ can be used for the binary search tree.

Related

Find mode of a multiset in given time bound (most multiplicity)

The given problem:
A multiset is a set in which some of the elements occur more then once (e.g. {a, f, b, b, e, c, b, g, a, i, b} is a multiset). The elements are drawn from a totally ordered set. Present an algorithm, when presented with a multiset as input, finds an element that has the most occurrences in the multiset (e.g. in {a, f, b, b, e, c, b, g, a, c, b}, b has the most occurrences). The algorithm
should run in O(n lg n/M +n) time, where n is the number of elements in the multiset and M is the highest number of occurrences of an element in the multiset. Note that you do not know the value of M.
[Hint: Use a divide-and-conquer strategy based on the median of the list. The subproblems generated by the divide-and-conquer strategy cannot be smaller than a ‘certain’ size in order
to achieve the given time bound.]
Our initial solution:
Our idea was to use Moore's majority algorithm to determine if the multiset contained a majority candidate (eg. {a, b, b} has a majority, b). After determining if this was true or false we either output the result or find the median of the list using a given algorithm (known as Select) and split the list into three sublists (elements less than and equal to the median, and elements greater than the median). Again, we would check each of the lists to determine if the majority element was present, if so, that is your result.
For example, given the multiset {a, b, c, d, d, e, f}
Step 1: check for majority. None found, split the list based on the median.
Step 2: L1 = {a, b, c, d, d}, L2 = {e, f} Find the majority of each. None found, split the lists again.
Step 3: L11 = {a, b, c} L12 = {d, d} L21 = {e} L22 = {f} Check each for majority elements. L12 returns d. In this case, d is the most occurring elements in the original multiset, thus is the answer.
The issues we're having are whether this type of algorithm is fast enough, as well as whether this can be done recursively or if a loop that terminates is required. Like the hint says, the sub-problems cannot be smaller than a 'certain' size, which we believe to be M (the most occurrences).
If you use recursion in a most straight-forward way as described in your post, it will not have the desired time complexity. Why? Let's assume that the answer element is the largest one. Then it is always located in the right branch of recursion. But we call the left branch first, which can go much deeper if all elements are distinct there(getting pieces of size 1, while we do not want to get them smaller than M).
Here is a correct solution:
Let's always split the array into three parts at each step as described in your question. Now let's step aside and take a look at what we have: recursive calls form a tree. To get the desired time complexity, we should never go deeper than the level where the answer is located. To achieve this, we can traverse the tree using a breadth-first search with queue instead of a depth-first search. That's it.
If you want to do this in real life, it is worth considering using a hash table to track the counts. This can have amortized O(1) complexity per hash table access, so the overall complexity of the following Python code is O(n).
import collections
C = collections.Counter(['a','f','b','b','e','c','b','g','a','i','b'])
most_common_element, highest_count = C.most_common(1)[0]

Checking if A is a part of binary tree B

Let's say I have binary trees A and B and I want to know if A is a "part" of B. I am not only talking about subtrees. What I want to know is if B has all the nodes and edges that A does.
My thoughts were that since tree is essentially a graph, and I could view this question as a subgraph isomorphism problem (i.e. checking to see if A is a subgraph of B). But according to wikipedia this is an NP-complete problem.
http://en.wikipedia.org/wiki/Subgraph_isomorphism_problem
I know that you can check if A is a subtree of B or not with O(n) algorithms (e.g. using preorder and inorder traversals to flatten the trees to strings and checking for substrings). I was trying to modify this a little to see if I can also test for just "parts" as well, but to no avail. This is where I'm stuck.
Are there any other ways to view this problem other than using subgraph isomorphism? I'm thinking there must be faster methods since binary trees are much more restricted and simpler versions of graphs.
Thanks in advance!
EDIT: I realized that the worst case for even a brute force method for my question would only take O(m * n), which is polynomial. So I guess this isn't a NP-complete problem after all. Then my next question is, is there an algorithm that is faster than O(m*n)?
I would approach this problem in two steps:
Find the root of A in B (either BFS of DFS)
Verify that A is contained in B (giving that starting node), using a recursive algorithm, as below (I concocted same crazy pseudo-language, because you didn't specify the language. I think this should be understandable, no matter your background). Note that a is a node from A (initially the root) and b is a node from B (initially the node found in step 1)
function checkTrees(node a, node b) returns boolean
if a does not exist or b does not exist then
// base of the recursion
return false
else if a is different from b then
// compare the current nodes
return false
else
// check the children of a
boolean leftFound = true
boolean rightFound = true
if a.left exists then
// try to match the left child of a with
// every possible neighbor of b
leftFound = checkTrees(a.left, b.left)
or checkTrees(a.left, b.right)
or checkTrees(a.left, b.parent)
if a.right exists then
// try to match the right child of a with
// every possible neighbor of b
leftFound = checkTrees(a.right, b.left)
or checkTrees(a.right, b.right)
or checkTrees(a.right, b.parent)
return leftFound and rightFound
About the running time: let m be the number of nodes in A and n be the number of nodes in B. The search in the first step takes O(n) time. The running time of the second step depends on one crucial assumption I made, but that might be wrong: I assumed that every node of A is equal to at most one node of B. If that is the case, the running time of the second step is O(m) (because you can never search too far in the wrong direction). So the total running time would be O(m + n).
While writing down my assumption, I start to wonder whether that's not oversimplifying your case...
you could compare the trees in bottom-up as follows:
for each leaf in tree A, identify the corresponding node in tree B.
start a parallel traversal towards the root in both trees from the nodes just matched.
specifically, move to the parent of a node in A and subsequently move towards the root in B until you either encounter the corresponding node in B (proceed) or a marked node in A (see below, if a match in B is found proceed, else fail) or the root of B (fail)
mark all nodes visited in A.
you succeed, if you haven't failed ;-).
the main part of the algorithm runs in O(e_B) - in the worst case, all edges in B are visited a constant number of times. the leaf node matching will run in O(n_A * log n_B) if there the B vertices are sorted, O(n_A * log n_A + n_B * log n_B + n) = O(n_B * log n_B) (sort each node set, lienarly scan the results thereafter) otherwise.
EDIT:
re-reading your question, abovementioned step 2 is even easier, as for matching nodes in A, B, their parents must match too (otheriwse there would be a mismatch between the edge sets). no effect on worst-case run time, of course.

What is the difference between sequential access and sequential traversal of elements in data structures

Sequential traversal is the main difference between linear and non linear data structures.Can anyone explain it briefly?
A linear data structure is something like this:
A
B
C
D
E
For instance, lists and arrays. Each element is followed by a single element. Traversal is trivial, as you simply go from one element to the next. For instance, if you start at A, you only have one next element B, from B you only have one next element C and so on.
A non-linear data structure is something like this:
A
/ \
B C
/ \ / \
D E F G
For instance, a tree. Notice how A is followed by two elements; B and C, and each of them is followed by two elements. Now traversal is more complex, because once you start from A, you have a choice of going to either B and C. What's more, once at B, you have a choice of going further down, or going "sideways" to C. In this case (a tree), your traversal options are breadth-first or depth-first.

Hierarchical undirected graph representation

I need to represent the graph like this:
Graph = graph([Object1,Object2,Object3,Object4],
[arc(Object1,Object2,connected),
arc(Object2,Object4,connected),
arc(Object3,Object4,connected),
arc(Object1,Object3,connected),
arc(Object2,Object3,parallel),
arc(Object1,Object4,parallel),
arc(Object2,Object3,similar_size),
arc(Object1,Object4,similar_size)])
I have no restriction for code, however I'd stick to this representation as it fits all the other structures I've already coded.
What I mean is the undirected graph in which vertices are some objects and edges representing undirected relations between them. To give you more background in this particular example I'm trying to represent a rectangle, so objects are its four edges(segments). Those segments are represented in the same way with use of vertices and so on. The point is to build the hierarchy of graphs which would represent constraints between objects on the same level.
The problem lays in the representation of edges. The most obvious way to represent an arc (a,b) would be to put both (a,b) and (b,a) in the program. This however floods my program with redundant data exponentialy. For example if I have vertices a,b,c,d. I can build segments (a,b),(a,c),(a,d),(b,c),(b,d),(c,d). But I get also (b,a),(c,a), and so on. At this points its not a problem. But later I build a rectangle. It can be build of segments (a,b),(b,c),(c,d),(a,d). And I'd like to get the answer - there's one rectangle. You can calculate however how many combination of this one rectangle I get. It also take too much time to calculate and obviously I don't want to finish at the rectangle level.
I thought about sorting the elements. I can sort vertices in a segment. But if I want to sort segments in a rectangle the constraints are no longer valid. The graph becomes directed. For example taking into consideration the first two relations let's say we have arcs (a,b) and (a,c). If arcs are not sorted the program answers as I want it to: arc(b,a,connected),arc(a,c,connected) with match: Object1=b,Object2=a,Object4=c. If I sort elements it's no longer valid as I cannot have arc(b,a,connected) and arc(a,b,connected) tried out. Only the second one. I'd stick with the sorting but I have no idea how to solve this last issue.
Hopefully I stated all of this quite clearly. I'd prefer to stay as close to the representation and ideas I already have. But completely new ones are also very welcome. I don't expect any exact answer, rather poitning me in the right direction or suggesting something specific to read as I'm quite new to Prolog and maybe this problem is not as uncommon as I think.
I'm trying to solve this since yesterday and couldn't come up with any easy answer. I looked at some discrete math and common undirected graphs representation like adjacency list. Let me know if anything is unclear - I'll try to provide more details.
Interesting question although a bit broad since it is not stated what you actually want to do with the arcs, rectangles etc; a representation may be efficient (time/space/elegance) only with certain uses. In any case, here are some ideas:
Sorting
the obvious issue is the one you mentioned; you can solve it by introducing a clause that succeeds if the sorted pair exists:
arc(X,Y):-
arc_data(X,Y)
; arc_data(Y,X).
note that you should not do something like:
arc(a,b).
arc(b,c).
arc(X,Y):-
arc(Y,X)
since this will result in a infinite loop if the arc does not exist.
you could however only check if the first arg is larger than the second:
arc(a,b).
arc(b,c).
arc(X,Y):-
compare(>,X,Y),
arc(Y,X)
This approach will not resolve the multiple solutions that may arise due to having an arc represented in two ways.
The easy fix would be to only check for one solution where only one solution is expected using once/1:
3 ?- arc(X,Y).
X = a,
Y = b ;
X = b,
Y = a.
4 ?- once(arc(X,Y)).
X = a,
Y = b.
Of course you cannot do this when there could be multiple solutions.
Another approach would be to enforce further abstraction: at the moment, when you have two points (a, b) you can create the arc (arc(a,b) or arc(b,a)) after checking if those points are connected. Instead of that, you should create the arc through a predicate (that could also check if the points are connected). The benefit is that you no longer get involved in the representation of the arc directly and can thus enforce sorting (yes, it's basically object orientation):
cv_arc(X,Y,Arc):-
( arc(X,Y),
Arc = arc(X,Y))
; ( arc(Y,X),
Arc = arc(Y,X)).
(assuming as a database arc(a,b)):
6 ?- cv_arc(a,b,A).
A = arc(a, b).
7 ?- cv_arc(b,a,A).
A = arc(a, b).
8 ?- cv_arc(b,c,A).
false.
Of course you would need to follow a similar principle for the rest of the objects; I assume that you are doing something like this to find a rectangle:
rectangle(A,B,C,D):-
arc(A,B),
arc(B,C),
arc(C,D),
arc(D,A).
besides the duplicates due to the arc (which are resolved) this would recognise ABCD, DABC etc as different rectangles:
28 ?- rectangle(A,B,C,D).
A = a,
B = b,
C = c,
D = d ;
A = b,
B = c,
C = d,
D = a ;
A = c,
B = d,
C = a,
D = b ;
A = d,
B = a,
C = b,
D = c.
We will do the same again:
rectangle(rectangle(A,B,C,D)):-
cv_arc(A,B,AB),
cv_arc(B,C,BC),
compare(<,AB,BC),
cv_arc(C,D,CD),
compare(<,BC,CD),
cv_arc(D,A,DA),
compare(<,CD,DA).
and running with arc(a,b). arc(b,c). arc(c,d). arc(a,d).:
27 ?- rectangle(R).
R = rectangle(a, b, c, d) ;
false.
Note that we did not re-order the rectangle if the arcs were in the wrong order; we simply failed it. This way we avoided duplicate solutions (if we ordered them and accepted it as a valid rectangle we would have the same rectangle four times) but the time spent to find the rectangle increases. We reduced the overhead by stopping the search at the first arc that is out of order instead of creating the whole rectangle. Also, the overhead would also be reduced if the arcs are ordered (since the first match would be ordered). On the other hand, if we consider the complexity of searching for all rectangles this way, the overhead is not that significant. Also, it only applies if we want just the first rectangle; should we want to get more solutions or ensure that there are no other solutions, prolog will search the whole tree, whether it reports the solutions or not.

Comparison Based Ranking Algorithm (Variation)

This question is a variation on a previous question:
Comparison-based ranking algorithm
The variation I would like to pose is: what if loops are solved by discarding the earliest contradicting choices so that a transitive algorithm could actually be used.
Here I have pasted the original question:
"I would like to rank or sort a collection of items (with size potentially greater than 100,000) where each item in the collection does not have an intrinsic (comparable) value, instead all I have is the comparisons between any two items which have been provided by users in a 'subjective' manner.
Example:
Consider a collection with elements [a, b, c, d]. And comparisons by users:
b > a, a > d, d > c
The correct order of this collection would be [b, a, d, c].
This example is simple however there could be more complicated cases:
Since the comparisons are subjective, a user could also say that c > b. In which case that would cause a conflict with the ordering above. Also you may not have comparisons that 'connects' all the items, ie:
b > a, d > c. In which case the ordering is ambiguous. It could be : [b, a, d, c] or [d, c, b, a]. In this case either ordering is acceptable.
...
The Question is:
Is there an algorithm which already exists that can solve the problem above, I would not like to spend effort trying to come up with one if that is the case. If there is no specific algorithm, is there perhaps certain types of algorithms or techniques which you can point me to?"
The simpler version where no "cycle" exists can be dealt with using topological sorting.
Now, for the more complex scenario, if for every "cycle" the order on which the elements appear in the final ranking does not matter, then you could try the following:
model the problem as a directed graph (i.e. the fact that a > b implies that there is an edge in the resulting graph starting in node "a" and ending in node "b").
calculate the strongly connected components (SCC) of the graph. In short, an SCC is a set of nodes with the property that you can get to any node in the set from any node in the set by following a list of edges (this corresponds to your "cycles" in the original problem).
transform the graph by "collapsing" each node into the SCC it belongs to, but preserve the edges that that go between different SCC's.
it turns out the new graph obtained in the way mentioned above is a directly acyclic graph so we can perform a topological sort on it.
Finally, we're ready. The topological sort should tell you the right order in which to print nodes in different SCC's. For the nodes in the same SCC's, no matter what the order you choose is, there will always be "cycles", so a possibility might be printing them in a random order.

Resources