Algorithm to find difference between two uniquely labeled general trees - algorithm

I'm looking for an algorithm that finds the minimal changes (insert, delete, move) to get from tree1 to tree2.
The tree is a general and uniquely labeled tree.
This means that each node's value is some kind of unique identifier, like a UUID.
Each node can have children and the order of those child nodes is important.
Example tree 1
A
- B
- C
- D
- E
- F
Example tree 2
A
- B
- F
- C
- D
- G
expected changes from tree 1 to tree 2
move(atIndex: 1, inParent: A, toIndex: 0, inParent: F)
delete(atIndex: 1, inParent: C)
insert(atIndex: 2, inParent: A)
I need this for a macOS App to animated changes in an NSOutlineView.
I found some algorithms to find changes between general trees, but only for ordered and labeled trees and none for uniquely labeled trees.
The algorithms for non-uniquely labeled trees semes to be really complex and I thought that if the labels are unique there must be a simpler and more efficient algorithm.
BTW: NSOultineView does not have a root node but an array of nodes at the beginning. I don't know if that's important.

You're right -- label uniqueness does simplify the problem.
For each label present in both trees, compute a sequence of operations to edit its children. This involves computing a longest common subsequence and issuing appropriate moves/inserts/deletes for all nodes not in this subsequence. (In case a label is present in both trees but with different parents, we should suppress the delete for the old parent.)
This set of operations constitutes a lower bound, but if we just do them, we may hit a situation where we try to move a node under one of its descendants. Perhaps it's enough to detect when this is about to happen and split it into two moves. If you really need the optimal sequence of moves, then I think there's a feedback arc/node set problem to be solved, though it might not be NP-hard here due to the tree structure.

Given that labels are unique, I'm not convinced that "minimal" changes are any different than outlining the diff for each node's full path.
If we look at the path from right to left, we can see that the first parent and index of D below didn't change.
It seems like the minimal changes provided in the question description make an assumption that the move of F from index 2 to 1 happens magically when C at 1 moves. But rather the two actions of "move C" and "move F to C's former place," which we would get from just examining the paths, comprise the actual behaviour that would need to occur due to the assumption in your current "move C."
Example tree 1
A ⊥
- B A 0
- C A 1
- D A C 0
- E A C 1
- F A 2
Example tree 2
A ⊥
- B A 0
- F *A 1
- C *A 1 F 0
- D *A 1 F 0 C 0
- G A 2

Related

How to determine if two binary trees are equal or different

The picture below is the case of different binary trees that can be made with 3 nodes.
But why is the following case not included in the number of cases?
Is it the same case as the third case from the left in the picture above? If so, I think the parent-child relationship will be different.
I'd appreciate it if you could tell me how to determine how binary trees are equal to or different from each other.
You mean 3 nodes. Your case is not included because the cases all use A as the root node, so it is easier to demonstrate the different possible combinations using always the same elements in the same order, i.e. A as root, then B -> C on top and symmetrically C -> B at the bottom.
Using B or C as the root, you could achieve the same number of variations. This number can be computed using the following formula:
In this case, n=3, so F(0)*F(2) + F(1)*F(1) + F(2)*F(0) = 2 + 1 + 2 = 5
F(0) = 1 (empty is considered one variation)
F(1) = 1
F(2) = 2
So in your picture, only one row is actually relevant if that is supposed to be a binary search tree. Also note that there is a difference between a binary tree and a binary search tree. From first article below:
As we know, the BST is an ordered data structure that allows no
duplicate values. However, Binary Tree allows values to be repeated
twice or more. Furthermore, Binary Tree is unordered.
References
https://www.baeldung.com/cs/calculate-number-different-bst
https://en.wikipedia.org/wiki/Binary_tree#Using_graph_theory_concepts
https://encyclopediaofmath.org/wiki/Binary_tree

Why inorder traversal of binary search tree guarantee to have non-decreasing order?

TL;DR My ultimate problem is to find the two nodes on a proper binary tree (i.e. itself has at least two nodes) that one is only greater than input, and the other only less than input. (cont. under the line)
To implement that, I personally asserted that literally if you draw a tree (decently), you see horizontally the one on the right is by all means greater than any one on its left.
In other words, quoting from Wikipedia (Binary search tree):
The key in each node must be greater than all keys stored in the left sub-tree, and smaller than all keys in the right sub-tree.
And it seems to only guaranteed to be true locally. With a figure like this:
A
/ \
B C
/ \ / \
D E F G
/ \
H I
(Letters have no order, i.e. just for structure)
By locally I mean when we talk about node E, it's guaranteed that D (with F and G) is smaller than E, but what about C, F, G compared to E, is that also guaranteed?
This seems quite intuitive (that F,C,G are all greater than E), but I don't find anyway to prove that, so is there any counterexample or theoretical proof? Any existed theorems or suggestions?
EDIT: I finally find this equivalent to why is in-order traversal of a binary search tree has a non-decreasing order.
This seems quite intuitive (that F,C,G are all greater than E), but I don't find anyway to prove that, so is there any counterexample or theoretical proof? Any existed theorems or suggestions?
F > A — definition of BST ("key in each node must be … smaller than all keys in the right sub-tree")
E < A — definition of BST ("key in each node must be greater than all keys stored in the left sub-tree")
E < A < F — transitivity
And so on for C and G
Imagine, you have same tree without E:
A
/ \
B C
/ / \
D F G
/ \
H I
Now you insert this E that greater than B. What if E greater than A also? In this case it will be inserted to right subtree of A, well? But while E is in right subtree of B, it less than A:
A
/ \
B C
/ \ / \
D E F G
/ \
H I
You need to differentiate between a "binary tree" and a "binary search tree" to be precise about it.
A binary search tree has the property you're looking for; all nodes in the left branch are smaller than all nodes in the right branch. If it's weren't the case, you couldn't use the search method usually associated -- that is, look at a node, and if you're looking for a smaller node, go left, if you're looking for a larger node, go right. I think both a basic BST, plus balanced trees like AVL and red-black, all observe this same property.
There are other binary tree data structures which aren't "Binary Search Trees" -- for example, a min-heap and max-heap are both binary trees, but the 'left is smaller than right' constraint is not met all the way through the tree. Heaps are used to find the smallest or largest element in a set, so you only normally reason about the node at the top of the tree.
As to a proof, I guess there is this; if you accept that the search algorithm works, then we can show that this property must hold. For instance, in this tree;
a
/ \
b n
/ \
c d
then let's say you wanted to prove that d is smaller than n, or any child. Imagine you were searching for d and you were at a. Obviously, you compare a to d and find that d is smaller -- that's how you know to traverse left, to b. So right there we've got to have faith that the entire left tree (b and under) must all be smaller than a. The same argument for the right hand side and numbers greater than a.
So left-children(a) < a < right-children(a)
In terrible pseudo-proof...

Algorithm to detect and remove least number of inconsistent facts (probably in PROLOG)?

How do you program the following algorithm?
Imagine a list of "facts" like this, where the letters represent variables bound to numeric values:
x = 1
y = 2
z = 3
a = 1
b = 2
c = 3
a = x
b = z
c = z
These "facts" clearly must not all be "true": it's not possible for b=z while b=2 and z=3. If either b=z or b=2 were removed, all facts would be consistent. If z=3 and either b=z or c=z were removed, then all facts would be consistent but there would be one fewer fact than above. This set contains many such consistent subsets. For instance, a=1, b=2, c=3 is a consistent subset, and so are many others.
Two consistent subsets are larger than any other consistent subset in this example:
x = 1
y = 2
z = 3
a = 1
b = 2
c = 3
a = x
c = z
and
x = 1
y = 2
z = 3
a = 1
c = 3
a = x
b = z
c = z
Using an appropriate programming language (I'm thinking PROLOG, but maybe I'm wrong) how would you process a large set containing consistent and inconsistent facts, and then output the largest possible sub-set of consistent facts (or multiple subsets as in the example above)?
This is very closely related to the NP-hard multiway cut problem. In the (unweighted) multiway cut problem, we have an undirected graph and a set of terminal vertices. The goal is to remove as few edges as possible so that each terminal vertex lies in its own connected component.
For this problem, we can interpret each variable and each constant as a vertex, and each equality as an edge from its left-hand side to its right-hand side. The terminal vertices are those associated with constants.
For only two terminals, the multiway cut problem is the polynomial-time solvable s-t minimum cut problem. We can use minimum cuts to get a polynomial-time 2-approximation to the multiway cut problem, by finding the cheapest cut separating two terminals, deleting the edges involved, and then recursing on the remaining connected components. Several approximation algorithms with better ratios have been proposed in the theoretical literature on multiway cut.
Practically speaking, multiway cut arises in applications to computer vision, so I would expect that there has been some work on obtaining exact solutions. I don't know what's out there, though.
Prolog could serve as a convenient implementation language, but some thought about algorithms suggests that a specialized approach may be advantageous.
Among statements of these kinds (equalities between two variables or between one variable and one constant) the only inconsistency that can arise is a path connecting two distinct constants.
Thus, if we find all "inconsistency" paths that connect pairs of distinct constants, it is necessary and sufficient to find a set of edges (original equalities) that disconnect all of those paths.
It is tempting to think a greedy algorithm is optimal here: always select an edge to remove which is common to the largest number of remaining "inconsistency" paths. Thus I propose:
1) Find all simple paths P connecting two different constants (without passing through any third constant), and construct a linkage structure between these paths and their edges.
2) Count the frequencies of edges E appearing along those "inconsistency" paths P.
3) Find a sufficient number of edges to remove by pursuing a greedy strategy, removing the next edge that appears most frequently and updating the counts of edges in remaining paths accordingly.
4) Given that upper bound on edges necessary to remove (to leave a consistent subset of statements), apply a backtracking strategy to determine whether any smaller number of edges would suffice.
As applied to the example in the Question, there prove to be exactly two "inconsistency" paths:
2 -- b -- z -- 3
2 -- b -- z -- c -- 3
Removing either of the two edges, 2 -- b or b -- z, common to both of these paths suffices to disconnect both "inconsistency" paths (removing all inconsistencies among remaining statements).
Moreover it is evident that no other single edge's removal would suffice to accomplish that.

Matrix tree data structure

Recently I heard from professor in my university about such data structure as "matrix tree". I can understand what it is, but where is it useful?
I'll try to give short explanation for this structure:
We have a tree root - special node. Then we have left and right "children" (subtrees). Both are binary trees. And if there is no some numbers in the tree but we add their "descendants" then we add this missing numbers as parasitic (so subtrees is nearly full). In left tree all nodes are even numbers. Others are in right tree. And for N we can say that N = 2^L(2*Y - 1) where N - node value (even, in this case), L - level number and Y - position in level.
Example (even subtree):
8
/ \
4 12
/ \ /\
2 6 10 14
If we exclude, for example, 4, it become parasitic (special flag in node) and that's all.

O(1) algorithm to determine if node is descendant of another node in a multiway tree?

Imagine the following tree:
A
/ \
B C
/ \ \
D E F
I'm looking for a way to query if for example F is a descendant of A (note: F doesn't need to be a direct descendant of A), which, in this particular case would be true. Only a limited amount of potential parent nodes need to be tested against a larger potential descendants node pool.
When testing whether a node is a descendant of a node in the potential parent pool, it needs to be tested against ALL potential parent nodes.
This is what a came up with:
Convert multiway tree to a trie, i.e. assign the following prefixes to every node in the above tree:
A = 1
B = 11
C = 12
D = 111
E = 112
F = 121
Then, reserve a bit array for every possible prefix size and add the parent nodes to be tested against, i.e. if C is added to the potential parent node pool, do:
1 2 3 <- Prefix length
*[1] [1] ...
[2] *[2] ...
[3] [3] ...
[4] [4] ...
... ...
When testing if a node is a descendant of a potential parent node, take its trie prefix, lookup the first character in the first "prefix array" (see above) and if it is present, lookup the second prefix character in the second "prefix array" and so on, i.e. testing F leads to:
F = 1 2 1
*[1] [1] ...
[2] *[2] ...
[3] [3] ...
[4] [4] ...
... ...
so yes F, is a descendant of C.
This test seems to be worst case O(n), where n = maximum prefix length = maximum tree depth, so its worst case is exactly equal to the obvious way of just going up the tree and comparing nodes. However, this performs much better if the tested node is near the bottom of the tree and the potential parent node is somewhere at the top. Combining both algorithms would mitigate both worst case scenarios. However, memory overhead is a concern.
Is there another way for doing that? Any pointers greatly appreciated!
Are your input trees always static? If so, then you can use a Lowest Common Ancestor algorithm to answer the is descendant question in O(1) time with an O(n) time/space construction. An LCA query is given two nodes and asked which is the lowest node in the tree whose subtree contains both nodes. Then you can answer the IsDescendent query with a single LCA query, if LCA(A, B) == A or LCA(A, B) == B, then one is the descendent of the other.
This Topcoder algorithm tuorial gives a thorough discussion of the problem and a few solutions at various levels of code complexity/efficiency.
I don't know if this would fit your problem, but one way to store hierarchies in databases, with quick "give me everything from this node and downwards" features is to store a "path".
For instance, for a tree that looks like this:
+-- b
|
a --+ +-- d
| |
+-- c --+
|
+-- e
you would store the rows as follows, assuming the letter in the above tree is the "id" of each row:
id path
a a
b a*b
c a*c
d a*c*d
e a*c*e
To find all descendants of a particular node, you would do a "STARTSWITH" query on the path column, ie. all nodes with a path that starts with a*c*
To find out if a particular node is a descendant of another node, you would see if the longest path started with the shortest path.
So for instance:
e is a descendant of a since a*c*e starts with a
d is a descendant of c since a*c*d starts with a*c
Would that be useful in your instance?
Traversing any tree will require "depth-of-tree" steps. Therefore if you maintain balanced tree structure it is provable that you will need O(log n) operations for your lookup operation. From what I understand your tree looks special and you can not maintain it in a balanced way, right? So O(n) will be possible. But this is bad during creation of the tree anyways, so you will probably die before you use the lookup anyway...
Depending on how often you will need that lookup operation compared to insert, you could decide to pay during insert to maintain an extra data structure. I would suggest a hashing if you really need amortized O(1). On every insert operation you put all parents of a node into a hashtable. By your description this could be O(n) items on a given insert. If you do n inserts this sounds bad (towards O(n^2)), but actually your tree can not degrade that bad, so you probably get an amortized overall hastable size of O(n log n). (actually, the log n part depends on the degration-degree of your tree. If you expect it to be maximal degraed, don't do it.)
So, you would pay about O(log n) on every insert, and get hashtable efficiency O(1) for a lookup.
For a M-way tree, instead of your bit array, why not just store the binary "trie id" (using M bits per level) with each node? For your example (assuming M==2) : A=0b01, B=0b0101, C=0b1001, ...
Then you can do the test in O(1):
bool IsParent(node* child, node* parent)
{
return ((child->id & parent->id) == parent->id)
}
You could compress the storage to ceil(lg2(M)) bits per level if you have a fast FindMSB() function which returns the position of the most significant bit set:
mask = (1<<( FindMSB(parent->id)+1) ) -1;
retunr (child->id&mask == parent->id);
In a pre-order traversal, every set of descendants is contiguous. For your example,
A B D E C F
+---------+ A
+---+ B
+ D
+ E
+-+ C
+ F
If you can preprocess, then all you need to do is number each node and compute the descendant interval.
If you can't preprocess, then a link/cut tree offers O(log n) performance for both updates and queries.
You can answer query of the form "Is node A a descendant of node B?" in constant time, by just using two auxiliary arrays.
Preprocess the tree, by visiting in Depth-First order, and for each node A store its starting and ending time in the visit in the two arrays Start[] and End[].
So, let us say that End[u] and Start[u] are respectively the ending and starting time of the visit of node u.
Then node u is a descendant of node v if and only if:
Start[v] <= Start[u] and End[u] <= End[v].
and you are done, checking this condition requires just two lookup in the arrays Start and End
Take a look at Nested set model It's very effective to select but too slow to update
For what it's worth, what you're asking for here is equivalent to testing if a class is a subtype of another class in a class hierarchy, and in implementations like CPython this is just done the good old fashioned "iterate the parents looking for the parent" way.

Resources