I transformed a graph with cycles and multiple parents to XML such that I can use XQuery on it.
The graph is on the left and the XML-tree is on the right.
I transformed the graph by writing down all child nodes from the first node (node 1) and repeat that on the returned nodes until no more children exist or a node has already been visited (like node 2).
Further more, I added the constraint, that all nodes with the same number have to be selected, if one of them is selected. (For example, if node 2 (child of 1) is selected, then we also have to select node 2 (child of 6) in the XML-tree.)
The operations I can use on the graph are: getPatents, getChildren, readValue(node).
In the graph, all information is stored in the node, and in the XML-tree all Information of a node is stored as attributes.
My Question: I want to synchronize both structures, such that I can apply an axis like ancestor (or descendant) on the graph and on the XML-tree and get the same result.(I can parse the graph with Python and the XML-tree with XQuery)
My Problem: If I select node 8 on the graph and apply the ancestor function, it'll return: 4, 5, 2, 1, 6, 3 (6 and 3 because of the cycle).
The ancestor axis on the XML-tree would return (we have to select both 8s): 4, 5, 2, 1 (the second 2, (child of 6) would also be selected due to the constraint, but not node 6 and 3).
My Solution: Changing the ancestor axis such that it returns all parents of the selected nodes, then applies the constraint and then selects again all parents and so on. But this solution seems to be very complicated and inefficient. Is there any better way?
Thanks for your help
I think it is not that easy to solve that for that particular format and with XSLT/XQuery/XPath as the document order imposed by most step or except or intersect or the arbitrary order XQuery grouping gives make it hard to establish the nodes you want and in the order they are traversed, the easiest I could come up with is
declare namespace output = "http://www.w3.org/2010/xslt-xquery-serialization";
declare option output:method 'text';
declare option output:item-separator ', ';
declare variable $main-root := /;
declare function local:eliminate-duplicates($nodes as node()*) as node()*
{
for $node at $p in $nodes
group by $id := generate-id($node)
order by head($p)
return head($node)
};
declare function local:get-parents($nodes as element(node)*, $collected as element(node)*) as element(node)*
{
let $new-parents :=
for $p in local:eliminate-duplicates($nodes ! ..)
return $main-root//node[#value = $p/#value][not(. intersect $collected)]
return
if ($new-parents)
then local:get-parents($new-parents, ($collected, $new-parents))
else $collected
};
local:get-parents(//node[#value = 8], ()) ! #value ! string()
https://xqueryfiddle.liberty-development.net/gWmuPs8 gives 4, 5, 2, 2, 1, 6, 3.
How efficient that works will partly depend on any index used for the node[#value = $p/#value] comparison, in XSLT you could ensure that with a key (https://xsltfiddle.liberty-development.net/aiyneS), in database oriented XQuery processors probably with an attribute based index.
Related
A friend of mine was asked this question in an interview.
Given two binary trees, explain how would you create a diff such that if you have that diff and either of the trees you should be able to generate the other binary tree. Implement a function createDiff(Node tree1, Node tree 2) returns that diff.
Tree 1
4
/ \
3 2
/ \ / \
5 8 10 22
Tree 2
1
\
4
/ \
11 12
If you are given Tree 2 and the diff you should be able to generate Tree 1.
My solution:
Convert both the binary trees into array where left child is at 2n+1 and right child is at 2n+2and represent empty node by -1. Then just do element-wise subtraction of the array to create the diff. This solution will fail if tree has -1 as node value and I think there has to be a better and neat solution but I'm not able to figure it out.
Think of them as direcory tres and print a sorted list of the path to every leaf item
Tree 1 becomes:
4/2/10
4/2/22
4/3/5
4/3/8
These list formats can be diff'ed and the tree recreated from such a list.
There are many ways to do this.
I would suggest that you turn the tree into a sorted array of triples of (parent, child, direction). So start with tree1:
4
/ \
3 2
/ \ / \
5 8 10 22
This quickly becomes:
(None, 4, None) # top
(4, 3, L)
(3, 5, L)
(3, 8, L)
(4, 2, R)
(2, 10, L)
(2, 22, R)
Which you sort to get
(None, 4, None) # top
(2, 10, L)
(2, 22, R)
(3, 5, L)
(3, 8, L)
(4, 2, R)
(4, 3, L)
Do the same with the other, and then diff them.
Given a tree and the diff, you can first turn the tree into this form, look at the diff, realize which direction it is and get the desired representation with patch. You can then reconstruct the other tree recursively.
The reason why I would do it with this representation is that if the two trees share any subtrees in common - even if they are placed differently in the main tree - those will show up in common. And therefore you are likely to get relatively small diffs if the trees do, in fact, match in some interesting way.
Edit
Per point from #ruakh, this does assume that values do not repeat in a tree. If they do, then you could do a representation like this:
4
/ \
3 2
/ \ / \
5 8 10 22
becomes
(, 4)
(0, 3)
(00, 5)
(01, 8)
(1, 2)
(10, 10)
(11, 22)
And now if you move subtrees, they will show up as large diffs. But if you just change one node, it will still be a small diff.
(The example from the question(/interview) is not very helpful in not showing any shared sub-structure of non-trivial size. Or the interview question outstanding for initiating a dialogue between customer and developer.)
Re-use of subtrees needs a representation allowing to identify such. It seems useful to be able to reconstruct the smaller tree without walking most of the difference. Denoting "definition" of identifiable sub-trees with capital letters and re-use by a tacked-on ':
d e d--------e
c b "-" c b => C B' C' b
b a a b a a B a a
a a a
(The problem statement does not say diff is linear.)
Things to note:
there's a sub-tree B occurring in two places of T1
in T2, there's another b with one leaf-child a that is not another occurrence of B
no attempt to share leaves
What if now I imagine (or the interviewer suggests) two huge trees, identical but for one node somewhere in the middle which has a different value?
Well, at least its sub-trees will be shared, and "the other sub-trees" all the way up to the root. Too bad if the trees are degenerated and almost all nodes are part of that path.
Huge trees with children of the root exchanged?
(Detecting trees occurring more than once has a chance to shine here.)
The bigger problem would seem to be the whole trees represented in "the diff", while the requirement may be
Given one tree, the diff shall support reconstruction of the other using little space and processing.
(It might include setting up the diff shall be cheap, too - which I'd immediately challenge: small diff looks related to editing distance.)
A way to identify "crucial nodes" in each tree is needed - btilly's suggestion of "left-right-string" is good as gold.
Then, one would need a way to keep differences in children & value.
That's the far end I'd expect an exchange in an interview to reach.
To detect re-used trees, I'd add the height to each internal node. For a proof of principle, I'd probably use an existing implementation of find repeated strings on a suitable serialisation.
There are many ways to think of a workable diff-structure.
Naive solution
One naive way is to store the two trees in a tuple. Then, when you need to regenerate a tree, given the other and the diff, you just look for a node that is different when comparing the given tree with the tree in the first tuple entry of the diff. If found you return that tree from the first tuple entry. If not found, you return the second one from the diff tuple.
Small diffs for small differences
An interviewer would probably ask for a less memory consuming alternative. One could try to think of a structure that will be small in size when there are only a few values or nodes different. In the extreme case where both trees are equal, such diff would be (near-)empty as well.
Definitions
I define these terms before defining the diff's structure:
Imagine the trees get extra NIL leaf nodes, i.e. an empty tree would consist of 1 NIL node. A tree with only a root node, would have two NIL nodes as its direct children, ...etc.
A node is common to both trees when it can be reached via the same path from the root (e.g. left-left-right), irrespective of whether they contain the same value or have the same children. A node can even be common when it is a NIL node in one or both of the trees (as defined above).
Common nodes (including NIL nodes when they are common) get a preorder sequence number (0, 1, 2, ...). Nodes that are not common are discarded during this numbering.
Diff structure
The difference could be a list of tuples, where each tuple has this information:
The above mentioned preorder sequence number, identifying a common node
A value: when neither nodes is a NIL node, this is the diff of the values (e.g. XOR). When one of the nodes is a NIL node, the value is the other node object (so effectively including the whole subtree below it). In typeless languages, either information can fit in the same tuple position. In strongly typed languages, you would use an extra entry in the tuple (e.g. atomicValue, subtree), where only one of two would have a significant value.
A tuple will only be added for a common node, and only when either their values differ, and at least one of both is a not-NIL node.
Algorithm
The diff can be created via a preorder walk through the common nodes of the trees.
Here is an implementation in JavaScript:
class Node {
constructor(value, left, right) {
this.value = value;
if (left) this.left = left;
if (right) this.right = right;
}
clone() {
return new Node(this.value, this.left ? this.left.clone() : undefined,
this.right ? this.right.clone() : undefined);
}
}
// Main functions:
function createDiff(tree1, tree2) {
let i = -1; // preorder sequence number
function recur(node1, node2) {
i++;
if (!node1 !== !node2) return [[i, (node1 || node2).clone()]];
if (!node1) return [];
const result = [];
if (node1.value !== node2.value) result.push([i, node1.value ^ node2.value]);
return result.concat(recur(node1.left, node2.left), recur(node1.right, node2.right));
}
return recur(tree1, tree2);
}
function applyDiff(tree, diff) {
let i = -1; // preorder sequence number
let j = 0; // index in diff array
function recur(node) {
i++;
let diffData = j >= diff.length || diff[j][0] !== i ? 0 : diff[j++][1];
if (diffData instanceof Node) return node ? undefined : diffData.clone();
return node && new Node(node.value ^ diffData, recur(node.left), recur(node.right));
}
return recur(tree);
}
// Create sample data:
let tree1 =
new Node(4,
new Node(3,
new Node(5), new Node(8)
),
new Node(2,
new Node(10), new Node(22)
)
);
let tree2 =
new Node(2,
undefined,
new Node(4,
new Node(11), new Node(12)
)
);
// Demo:
let diff = createDiff(tree1, tree2);
console.log("Diff:");
console.log(diff);
const restoreTree2 = applyDiff(tree1, diff);
console.log("Is restored second tree equal to original?");
console.log(JSON.stringify(tree2)===JSON.stringify(restoreTree2));
const restoreTree1 = applyDiff(tree2, diff);
console.log("Is restored first tree equal to original?");
console.log(JSON.stringify(tree1)===JSON.stringify(restoreTree1));
const noDiff = createDiff(tree1, tree1);
console.log("Diff for two equal trees:");
console.log(noDiff);
So I am not asking diagonal view of a tree, which fortunately I already know. I am asking if I view a tree from 45-degree angle only a few nodes should be visible. So there is a plane which at an angle of 45-degrees from the x-axis. so we need to print all the nodes which are visible from that plane.
For example:
1
/ \
2 3
/ \ / \
4 5 6 7
So if I look from that plane, I will only see nodes [4, 6, 7] as 5 and 6 overlaps each other. If I add another node at 6, now it will hide 7. How to do that? I searched on internet but couldn't find the answer.
Thanks!
I am giving you an abstract answer as the question is not language specific.
The problem with logging trees like this is the use of recursion.
By that I mean the traversal is going down nodes and up nodes.
What if you wrote a height helper which would return the depth of the current node.
For each depth level, you place the value in an array.
Then, write the values of the array.
Then you could grab the length of the last array and determine the amount of spaces each node needs.
Allow the arrays to hold empty values or else you will have to keep track of which nodes dont have children.
int total_depth = tree.getTotalHeight();
int arr[total_depth] = {};
for(int i = total_depth; i--;){
// there is a formula for the max number of nodes at a given depth of a binary tree
arr[i] = int arr[maximum_nodes_at_depth]
}
tree.inorderTraverse(function(node){
int depth = node.getHeightHelper();
// check if item is null
if( node!=nullptr && node.Item != NULL)
{
arr[depth].push(node.Item)
}
else
{
arr[depth].push(NULL)
}
})
So now you would have to calculate the size of your tree and then dynamically calculate how many spaces should prefix each node. The lower the depth the more prefixed spaces to center it.
I apologize but the pseudocode is a mix of javascript and c++ syntax.... which should never happen lol
In a generic tree represented by the common node structure having parent and child pointers, how can one find a list of all paths that have no overlapping edges with each other and terminate with a leaf node.
For example, given a tree like this:
1
/ | \
2 3 4
/ \ | / \
5 6 7 8 9
The desired output would be a list of paths as follows:
1 2 1 1 4
| | | | |
2 6 3 4 9
| | |
5 7 8
Or in list form:
[[1, 2, 5], [2, 6], [1, 3, 7], [1, 4, 8], [4, 9]]
Obviously the path lists themselves and their order can vary based on the order of processing of tree branches. For example, the following is another possible solution if we process left branches first:
[[1, 4, 9], [4, 8], [1, 3, 7], [1, 2, 6], [2, 5]]
For the sake of this question, no specific order is required.
You can use a recursive DFS algorithm with some modifications.
You didn't say what language you use, so, I hope that C# is OK for you.
Let's define a class for our tree node:
public class Node
{
public int Id;
public bool UsedOnce = false;
public bool Visited = false;
public Node[] Children;
}
Take a look at UsedOnce variable - it can look pretty ambigious.
UsedOnce equals to true if this node has been used once in an output. Since we have a tree, it also means that an edge from this node to its parent has been used once in an output (in a tree, every node has only one parent edge which is the edge to its parent). Read this carefully to not become confused in future.
Here we have a simple, basic depth-first search algorithm implementation.
All the magic will be covered in an output method.
List<Node> currentPath = new List<Node>(); // list of visited nodes
public void DFS(Node node)
{
if (node.Children.Length == 0) // if it is a leaf (no children) - output
{
OutputAndMarkAsUsedOnce(); // Here goes the magic...
return;
}
foreach (var child in node.Children)
{
if (!child.Visited) // for every not visited children call DFS
{
child.Visited = true;
currentPath.Add(child);
DFS(child);
currentPath.Remove(child);
child.Visited = false;
}
}
}
If OutputAndMarkedAsUsedOnce just outputed a currentPath contents, then we would have a plain DFS output like this:
1 2 5
1 2 6
1 3 7
1 4 8
1 4 9
Now, we need to use our UsedOnce. Let's find the last used-once-node (which has already been in an output) in current path and output all the path from this node inclusively. It is guaranteed that such node exists because, at least the last node in a path has never been met before and couldn't be marked as used once.
For instance, if the current path is "1 2 3 4 5" and 1, 2, 3 are marked as used once - then output "3 4 5".
In your example:
We are at "1 2 5". All of them are unused, output "1 2 5" and mark 1, 2, 5 as used once
Now, we are at "1 2 6". 1, 2 are used - 2 is the last one. Output from 2 inclusively, "2 6", mark 2 and 6 as used.
Now we are at "1 3 7", 1 is used, the only and the last. Output from 1 inclusively, "1 3 7". Mark 1, 3, 7 as used.
Now we are at "1 4 8". 1 is used, the only and the last. Output "1 4 8".
Now we are at "1 4 9". 1, 4 are used. Output from 4 - "4 9".
It works because in a tree "used node" means "used (the only parent) edge between it and its parent". So, we actually mark used edges and do not output them again.
For example, when we mark 2, 5 as used - it means that we mark edges 1-2 and 2-5. Then, when we go for "1 2 6" - we don't output edges "1-2" because it is used, but output "2-6".
Marking root node (node 1) as used once doesn't affect the output because its value is never checked. It has a physical explanation - root node has no parent edge.
Sorry for a poor explanation. It is pretty difficult to explain an algorithm on trees without drawing :) Feel free to ask any questions concerning algorithms or C#.
Here is the working IDEOne demo.
P.S. This code is, probably, not a good and proper C# code (avoided auto-properties, avoided LINQ) in order to make it understandable to other coders.
Of course, this algorithm is not perfect - we can remove currentPath because in a tree the path is easily recoverable; we can improve output; we can encapsulate this algorithm in a class. I just have tried to show the common solution.
This is a tree. The other solutions probably work but are unnecessarily complicated. Represent a tree structure in Python.
class Node:
def __init__(self, label, children):
self.label = label
self.children = children
Then the tree
1
/ \
2 3
/ \
4 5
is Node(1, [Node(2, []), Node(3, [Node(4, []), Node(5, [])])]). Make a recursive procedure as follows. We guarantee that the root appears in the first path.
def disjointpaths(node):
if node.children:
paths = []
for child in node.children:
childpaths = disjointpaths(child)
childpaths[0].insert(0, node.label)
paths.extend(childpaths)
return paths
else:
return [[node.label]]
This can be optimized (first target: stop inserting at the front of a list).
For all vertices, if the vertice is leaf (has no child pointers), go through the parent chain until you find a marked vertice or vertice with no parent. Mark all visited vertices. Collect the vertices to the intermediate list, then reverse it and add to the result.
If you cannot add a mark to the vertice object itself, you may implement the marking as a separate set of visited vertices and consider all the vertices added to the set as marked.
This can be very easily accomplished using DFS.
We call the DFS from root.
DFS(root,list)
where the list initially contains
list = {root}
Now the algorithm is as follows:
DFS(ptr,list)
{
if(ptr is a leaf)
print the list and return
else
{
for ith children of ptr do
{
if(ptr is root)
{
add the child to list
DFS(ith child of ptr,list)
remove the added child
}
else if(i equals 1 that is first child)
{
add the child to list
DFS(ith child of ptr,list)
}
else
{
initialize a new empty list list2
add ith child and the ptr node to list2
DFS(ith child of ptr,list2)
}
}
}
}
I have the edges and i want to build a tree with it.
The problem is that i can construct my tree structure only when edges are in specific order.
Example of orders:
(vertex, parent_vertex)
good: bad:
(0, ) <-top (3, 2)
(1, 0) (1, 0)
(2, 1) (3, 2)
(3, 2) (0, ) <-top
I iterate throw the edges and for current vertex trying to find it's parent in created tree, then i construct the node and insert it.
result tree:
0 - 1 - 2 - 3
So there is always must exist a parent in the tree for the new added vertex.
The question is how to sort the input edges. Voices tells me about the topological sort, but it's for vertexes. Is it possible to sort it right?
#mirt thanks for pointing out the optimizations on my approach, have you got any better?
i will put the below algo for ref
initially construct a hash map to store elements that are there in tree : H, add the root (null in your case/ or anything that represent that root)
taking the pair (_child, _parent)
loop through the whole list.
in the list. (each pair is the element)
for each pair, see if the _child and _parent is there in the hash map H, if you dont find, create the tree node for the missing ones and add them to H , and link them with the parent child relationship.
you will be left with the tree at the end of iteration.
complexity is O(n).
In my particular case, the graph is represented as an adjacency list and is undirected and sparse, n can be in the millions, and d is 3. Calculating A^d (where A is the adjacency matrix) and picking out the non-zero entries works, but I'd like something that doesn't involve matrix multiplication. A breadth-first search on every vertex is also an option, but it is slow.
def find_d(graph, start, st, d=0):
if d == 0:
st.add(start)
else:
st.add(start)
for edge in graph[start]:
find_d(graph, edge, st, d-1)
return st
graph = { 1 : [2, 3],
2 : [1, 4, 5, 6],
3 : [1, 4],
4 : [2, 3, 5],
5 : [2, 4, 6],
6 : [2, 5]
}
print find_d(graph, 1, set(), 2)
Let's say that we have a function verticesWithin(d,x) that finds all vertices within distance d of vertex x.
One good strategy for a problem such as this, to expose caching/memoisation opportunities, is to ask the question: How are the subproblems of this problem related to each other?
In this case, we can see that verticesWithin(d,x) if d >= 1 is the union of vertices(d-1,y[i]) for all i within range, where y=verticesWithin(1,x). If d == 0 then it's simply {x}. (I'm assuming that a vertex is deemed to be of distance 0 from itself.)
In practice you'll want to look at the adjacency list for the case d == 1, rather than using that relation, to avoid an infinite loop. You'll also want to avoid the redundancy of considering x itself as a member of y.
Also, if the return type of verticesWithin(d,x) is changed from a simple list or set, to a list of d sets representing increasing distance from x, then
verticesWithin(d,x) = init(verticesWithin(d+1,x))
where init is the function that yields all elements of a list except the last one. Obviously this would be a non-terminating recursive relation if transcribed literally into code, so you have to be a little bit clever about how you implement it.
Equipped with these relations between the subproblems, we can now cache the results of verticesWithin, and use these cached results to avoid performing redundant traversals (albeit at the cost of performing some set operations - I'm not entirely sure that this is a win). I'll leave it as an exercise to fill in the implementation details.
You already mention the option of calculating A^d, but this is much, much more than you need (as you already remark).
There is, however, a much cheaper way of using this idea. Suppose you have a (column) vector v of zeros and ones, representing a set of vertices. The vector w := A v now has a one at every node that can be reached from the starting node in exactly one step. Iterating, u := A w has a one for every node you can reach from the starting node in exactly two steps, etc.
For d=3, you could do the following (MATLAB pseudo-code):
v = j'th unit vector
w = v
for i = (1:d)
v = A*v
w = w + v
end
the vector w now has a positive entry for each node that can be accessed from the jth node in at most d steps.
Breadth first search starting with the given vertex is an optimal solution in this case. You will find all the vertices that within the distance d, and you will never even visit any vertices with distance >= d + 2.
Here is recursive code, although recursion can be easily done away with if so desired by using a queue.
// Returns a Set
Set<Node> getNodesWithinDist(Node x, int d)
{
Set<Node> s = new HashSet<Node>(); // our return value
if (d == 0) {
s.add(x);
} else {
for (Node y: adjList(x)) {
s.addAll(getNodesWithinDist(y,d-1);
}
}
return s;
}