Algorithm for searching for patterns in trees - algorithm

I am working on a project that heavily uses a tree structure for data processing. I am looking for a method to find matching patterns in the tree. for example, consider a tree like:
(1:a) ----- (2:b) ---- (4:c) ---- (5:e) ---- (8:c) ---- (9:f)
|---- (3:d) |--- (6:f) |--- (10:g)
|--- (7:g)
( 1 has two children 2 and 3, and 4 has children 5,6,7, and 8 has children 9 and 10 ) and the letters are the values of each node.
i need to find all the occurrences of something like
c ---- f
|--- g
which should return 4 and 8 as the indexes of the parent nodes. What is a good algorithm for that? It probably is BFS, but is there a more specialized search algorithm for this kind of searches?

This is some of my theory crafting, so feel free to correct me when I am wrong.
It is influenced by a prefix/suffix trie structure, which enables one to find matching substrings in a string. Although the Data Structure I will choose will be more tree-like, it will also be very graph-like in nature, by connecting references of nodes.
The output will ultimately (hopefully) show all indexes of the sub-tree roots that contains the pattern in a fast time.
The data structure I will decide to use is similar to a tree node, which contains the string value, indexes of every location of where this occurs, indexes of all possible parents of nodes containing the common value, and childs are stored as a Map for O(1) best case searching.
All following codes are done in C#.
public class Node
{
public String value; //This will be the value. ie: “a”
public Dictionary<int, int> connections; //connections will hold {int reference (key), int parent (value)} pairs
public Dictionary<String, Node> childs; //This will contain all childs, with it’s value
//as the key.
public Node()
{
connections = new Dictionary<int, int>();
childs = new Dictionary<String, Node>();
}
}
Second, we assume that your base data is a very traditional tree structure, although there may be few differences.
public class TreeNode
{
public int index;
public String value;
public List<TreeNode> childs;
public TreeNode()
{
childs = new List<TreeNode>();
}
public TreeNode(String value)
{
childs = new List<TreeNode>();
this.value = value;
}
public void add(String name)
{
TreeNode child = new TreeNode(name);
childs.Add(child);
}
}
Finally, the base TreeNode structure's nodes are all indexed (in your example, you have used a 1 based index, but the following is done in a 0 based index)
int index = 0;
Queue<TreeNode> tempQ = new Queue<TreeNode>();
tempQ.Enqueue(root);
while (tempQ.Count > 0)
{
temp = tempQ.Dequeue();
temp.index = index;
index++;
foreach (TreeNode tn in temp.childs)
{
tempQ.Enqueue(tn);
}
}
return root;
After we initialize our structure, assuming that the base data is stored in a traditional type of TreeNode structure, we will try to do three things:
Build a graph-like structure using the base TreeNode
One biggest property is that unique values will only be represented in ONE node. For example, {C}, {F}, and {G} from your example will each be represented with only ONE node, instead of two. (Simply stated, all nodes with common values will be grouped together into one.)
All unique nodes (from step 2) will be attached to the root element, and we will "rebuild" the tree by connecting references to references. (Graphic representation is soon shown below)
Here is the code in C# to build the structure, done in O(n):
private Node convert(TreeNode root)
{
Node comparisonRoot = new Node(); //root of our new comparison data structure.
//this main root will contain no data except
//for childs inside its child map, which will
//contain all childs with unique values.
TreeNode dataNode = root; //Root of base data.
Node workingNode = new Node(); //workingNode is our data structure's
//copy of the base data tree's root.
workingNode.value = root.value;
workingNode.connections.Add(0, -1);
// add workingNode to our data structure, because workingNode.value
// is currently unique to the empty map of the root's child.
comparisonRoot.childs.Add(workingNode.value, workingNode);
Stack<TreeNode> s = new Stack<TreeNode>();
s.Push(dataNode); //Initialize stack with root.
while (s.Count > 0) { //Iteratively traverse the tree using a stack
TreeNode temp = s.Pop();
foreach(TreeNode tn in temp.childs) {
//fill stack with childs
s.Push(tn);
}
//update workingNode to be the "parent" of the upcoming childs.
workingNode = comparisonRoot.childs[temp.value];
foreach(TreeNode child in temp.childs) {
if(!comparisonRoot.childs.ContainsKey(child.value)) {
//if value of the node is unique
//create a new node for the unique value
Node tempChild = new Node();
tempChild.value = child.value;
//store the reference/parent pair
tempChild.connections.Add(child.index, temp.index);
//because we are working with a unique value that first appeared,
//add the node to the parent AND the root.
workingNode.childs.Add(tempChild.value, tempChild);
comparisonRoot.childs.Add(tempChild.value, tempChild);
} else {
//if value of node is not unique (it already exists within our structure)
//update values, no need to create a new node.
Node tempChild = comparisonRoot.childs[child.value];
tempChild.connections.Add(child.index, temp.index);
if (!workingNode.childs.ContainsKey(tempChild.value)) {
workingNode.childs.Add(tempChild.value, tempChild);
}
}
}
}
return comparisonRoot;
}
All unique values are attached to a non-valued root, just for the purposes of using this root node as a map to quickly jump to any reference. (Shown below)
Here, you can see that all connections are made based on the original example tree, except that there are only one instance of nodes for each unique value. Finally, you can see that all of the nodes are also connected to the root.
The whole point is that there is only 1 real Node object for each unique copy, and points to all possible connections by having references to other nodes as childs. It's kind of like a graph structure with a root.
Each Node will contain all pairs of {[index], [parent index]}.
Here is a string representation of this data structure:
Childs { A, B, D, C, E, F, G }
Connections { A=[0, -1]; B=[1, 0]; D=[2, 0]; C=[3, 1][7, 4];
E=[4, 3]; F=[5, 3][8, 7]; G=[6, 3][9, 7] }
Here, the first thing you may notice is that node A, which has no true parent in your example, has a -1 for its parent index. It's just simply stating that Node A has no more parent and is the root.
Other things you may notice is that C has index values of 3 and 7, which respectively is connected to 1 and 4, which you can see is Node B and Node E (check your example if this doesn't make sense)
So hopefully, this was a good explanation of the structure.
So why would I decide to use this structure, and how will this help find out the index of the nodes when matched up with a certain pattern?
Similar to suffix tries, I thought that the most elegant solution would return all "successful searches" in a single operation, rather than getting traversing through all nodes to see if each node is a successful search (brute force).
So here is how the search will work.
Say we have the pattern
c ---- f
|--- g
from the example.
In a recursive approach, leaves simply return all possible parentIndex (retrieved from our [index, parentIndex] pairs).
Afterwards, in a natural DFS type of traversal, C will receive both return values of F and G.
Here, we do an intersection operation (AND operation) to all the childs and see which parentIndex the sets share in common.
Next, we do another AND operation, this time between the result from the previous step and all possible C's (our current branch's) index.
By doing so, we now have a set of all possible C's indexes that contains both G and F.
Although that pattern is only 2 levels deep, if we are looking at a pattern with a deeper level, we simply take the result set of C indexes, find all parent pairs of the result indexes utilizing our [index, parentIndex] map, and return that set of parentIndexes and return to step 2 of this method. (See the recursion?)
Here is the C# implementation of what was just explained.
private HashSet<int> search(TreeNode pattern, Node graph, bool isRoot)
{
if (pattern.childs.Count == 0)
{
//We are at a leaf, return the set of parents values.
HashSet<int> set = new HashSet<int>();
if (!isRoot)
{
//If we are not at the root of the pattern, we return the possible
//index of parents that can hold this leaf.
foreach (int i in graph.connections.Keys)
{
set.Add(graph.connections[i]);
}
}
else
{
//However if we are at the root of the pattern, we don't want to
//return the index of parents. We simply return all indexes of this leaf.
foreach (int i in graph.connections.Keys)
{
set.Add(i);
}
}
return set;
}
else
{
//We are at a branch. We recursively call this method to the
//leaves.
HashSet<int> temp = null;
foreach(TreeNode tn in pattern.childs) {
String value = tn.value;
//check if our structure has a possible connection with the next node down the pattern.
//return empty set if connection not found (pattern does not exist)
if (!graph.childs.ContainsKey(value)){
temp = new HashSet<int>();
return temp;
}
Node n = graph.childs[value];
//Simply recursively call this method to the leaves, and
//we do an intersection operation to the results of the
//recursive calls.
if (temp == null)
{
temp = search(tn, n, false);
}
else
{
temp.IntersectWith(search(tn, n, false));
}
}
//Now that we have the result of the intersection of all the leaves,
//we do a final intersection with the result and the current branch's
//index set.
temp.IntersectWith(graph.connections.Keys);
//Now we have all possible indexes. we have to return the possible
//parent indexes.
if (isRoot)
{
//However if we are at the root of the pattern, we don't want to
//return the parent index. We return the result of the intersection.
return temp;
}
else
{
//But if we are not at the root of the pattern, we return the possible
//index of parents.
HashSet<int> returnTemp = new HashSet<int>();
foreach (int i in temp)
{
returnTemp.Add(graph.connections[i]);
}
return returnTemp;
}
}
}
To call this method, simply
//pattern - root of the pattern, TreeNode object
//root - root of our generated structure, which was made with the compare() method
//boolean - a helper boolean just so the final calculation will return its
// own index as a result instead of its parent's indices
HashSet<int> answers = search(pattern, root.childs[pattern.value], true);
Phew, that was a long answer, and I'm not even sure if this is as efficient as other algorithms out there! I am also sure that there may be more efficient and elegant ways to search for a subtree inside a larger tree, but this was a method that came into my head! Feel free to leave any criticism, advice, edit, or optimize my solution :)

Idea 1
One simple way to improve the speed is to precompute a map from each letter to a list of all the locations in the tree where that letter occurs.
So in your example, c would map to [4,8].
Then when you search for a given pattern, you will only need to explore subtrees which have at least the first element correct.
Idea 2
An extension to this that might help for certain usage patterns is to also precompute a second map from each letter to a list of the parents of all locations in the tree where that letter occurs.
So for example, f would map to [4,8] and e to [4].
If the lists of locations are stored in sorted order then these maps can be used to efficiently find patterns with a head and certain children.
We get a list of possible locations by using the first map to look up the head, and additional lists by using the second map to look up the children.
You can then merge these lists (this can be done efficiently because the lists are sorted) to find entries that appear in every list - these will be all the matching locations.

Related

how should I find the shortest path in a single direction tree

I have a graph that is not binary, and is of single direction. It can look like the following:
I would like to find the shortest path, for example Team "G" to Team "D". What methods can I use?
By "of single direction" I imagine what you mean is that the tree is represented by nodes which only point at the children, not at the parents. Given that, user3386109's comment provides a simple way to get the answer you are looking for.
Find the path from the root to the first node by doing any tree traversal, like in-order (even if the order of children is insignificant, in practice, there will be some way to enumerate them in some order), and record the sequence of nodes from the root to the first node. In your example, we would get G-B-A (assuming a recursive solution where we are printing these nodes in reverse order).
Find the path from the root to the second node in the same way. Here, we'd get a path like D-A.
Find the first common ancestor of the two nodes; assuming the nodes are labeled uniquely as in this example, we can simply find the first symbol in either string of nodes, that is also in the other string of nodes. Regardless of which string we start with, we should get the same answer, as we do in your example: A
Chop off everything in both strings after the common ancestor, reverse one of the strings, and concatenate them with the common ancestor in the middle. This gives a path starting with node1 going to node2. Note that the problem asks for the shortest path; however, in a tree, there will be exactly one path between any two nodes. In your example, we get G-B-A-D.
Some pseudocode...
class Node {
char label;
Node[] children;
}
string FindPath(Node root, Node node1, Node node2) {
// make sure we have valid inputs
if (root == null || node1 == null || node2 == null) return null;
// look for paths to the nodes from the root
string path1 = FindPath(root, node1);
string path2 = FindPath(root, node2);
// one of the nodes wasn't found
if (path1 == null || path2 == null) return null;
// look for first common node
// note: this isn't the most efficient approach, see
// comments on time complexity below
for (int i = 0; i < path1.Length; i++) {
char label = path1[i];
if (path2.Contains(label) {
path1 = path1.Substring(0, i);
path2 = path2.Substring(0, path2.IndexOf(label);
return path1 + label + ReverseString(path2);
}
}
// will never reach here because it's guaranteed we will find
// a common ancestor in a tree
throw new Exception("Unreachable statement");
}
string FindPath(Node root, Node node) {
// make sure inputs are valid
if (root == null || node == null) return null;
if (root.label == node.label) {
// found the node
return node.label;
} else {
// this is not the node, exit early if no children
if (root.children == null || root.children.Count == 0) return null;
// check each child and if we find a path, return it
foreach (Node child in root.children) {
string path = FindPath(child, node);
if (path != null && path.Length > 0) {
return path + root.label;
}
}
// it's possible that the target node is not in this subtree
return null;
}
}
In terms of the number of nodes in the tree, the complexity should look like...
Getting the path from the root to each node should at worst visit each node in the tree, so O(|V|)
Looking for the first common node... well, a naive approach would be O(|V|^2) if we use the code as proposed above, since in the worst case, the tree is split in two long branches. But, we could be a little smarter and start at the end and work our way back as long as the strings match, and return the last matching node we saw as soon as we see that they don't match anymore (or we run out of symbols to check). Coding that is left as an exercise and that should be O(|V|)

Conversion of a tree to a new tree with p as its new root

The actual question is :
Write a program that takes as input a general tree T and a position p of T and converts T to another tree with the same set of position adjacencies, but now with p as its root.
I am not sure what does it means by the position adjacencies exactly but from what I understood is that the relation between the parent as the nodes should be maintained.
But if a node is made to be the root then they wont be having the same positional adjacency. I would like to implement this question using a binary tree for starters.
Can someone help me out with how should I implement it?
The quote uses some terms which should be further defined, but I will try to make some assumptions here:
...but now with p as its root.
This implies that the tree given as input is a so-called rooted tree and not just a general tree as it states at the outset. It is important to note this, as in graph theory a tree is just a graph where each pair of nodes is connected by a single path. The concept of root only enters the story when we speak of a rooted tree.
...a position p of T...
This is uncommon terminology. I'll assume that position is a synonym for what is generally called a vertex or node in graph theory.
... position adjacencies
I'll assume that this is referring to the edges of the tree.
... the relation between the parent as the nodes should be maintained.
If you phrase it that way, then it is impossible to change the tree, as the set of parent-child relationships uniquely defines the rooted tree. So we must assume that this is not about maintaining the parent-child relationship, but just the relationship between two vertices, where possibly the role of parent may be changing.
I would like to implement this question using a binary tree for starters.
That is not a good idea: a binary tree may need to transition to a non-binary tree. For example, let's say the input tree is this binary tree:
4
/
2
/ \
1 3
And the input vertex is the one with value 2. That means the output tree will not be binary, but will be:
2
/|\
1 3 4
So, we have a rooted tree and a vertex in that tree as input, and need to produce a rooted tree as output whose root is that given vertex.
Algorithm
The algorithm can be recursive:
If the given node p (that should become the root) has no parent, nothing needs to change: exit
Solve the problem of making that parent the root recursively. Once that is done the p's parent has no parent, since p's parent has become the root.
Detach p from its parent, and instead make that parent a child of p. So these two nodes remain connected, but the role of parent-child switches to child-parent.
Here is an implementation in JavaScript. You can run the snippet here on an example graph:
8
/ \
4 10
/
2 <-- to become new root
/ \
1 3
Resulting tree should be:
2
/|\
1 3 4
\
8
\
10
class Node {
constructor(value) {
this.value = value;
this.children = new Set; // any number of children
this.parent = null;
}
addChild(node) {
this.children.add(node);
node.parent = this; // back reference
return node;
}
detachFromParent() {
if (this.parent != null) {
this.parent.children.delete(this);
this.parent = null;
}
}
makeRoot() {
let parent = this.parent;
if (parent != null) {
parent.makeRoot();
this.detachFromParent();
this.addChild(parent);
}
}
print(indent="") {
console.log(indent + this.value);
indent += " ";
for (const child of this.children) {
child.print(indent);
}
}
}
// demo
let eight = new Node(8);
let ten = new Node(10);
let four = new Node(4);
let two = new Node(2);
let one = new Node(1);
let three = new Node(3);
eight.addChild(four);
eight.addChild(ten);
four.addChild(two);
two.addChild(one);
two.addChild(three);
console.log("input:");
eight.print();
console.log("change root to 2...");
two.makeRoot();
two.print();

Why AddAfter() has constant time?

In linked list operations, Addbefore(node,key) has linear time i.e., O(n) but AddAfter(node,key) has constant time i.e., O(1). Can anyone tell the reason?
Picture how a singly-linked list is organized:
A->B->C->D
Now, imagine you want to add a node after B. You can directly access the node and access its next pointer to link in a new node. So if you create a new node, call it X, with the passed key, you can do this:
Copy B's next pointer to X // B and X both point to C
Set B's next pointer to X // B points to X and X points to C
AddAfter(node,key)
{
newNode = CreateNewNode(key);
newNode.next = node.next;
node.next = newNode;
}
But if you want to add before, you don't know which node comes before B. So you have to scan the list to find out:
AddBefore(node, key)
{
parent = head;
// find the node that points to the passed node
while (parent.next != node)
{
parent = parent.next;
}
// Then add the new node after parent
AddAfter(parent, key);
}
That's not necessary with a doubly-linked list, because each node has a pointer to its predecessor as well as to its successor.
Jim

Efficiently convert array to cartesian tree

I know how to convert an array to a cartesian tree in O(n) time
http://en.wikipedia.org/wiki/Cartesian_tree#Efficient_construction and
http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=lowestCommonAncestor#From RMQ to LCA
However, the amount of memory required is too high (constants) since I need to associate a left and right pointer at least with every node in the cartesian tree.
Can anyone link me to work done to reduce these constants (hopefully to 1)?
You do not need to keep the right and left pointers associated with your cartesian tree nodes.
You just need to keep the parent of each node and by the definition of cartesian tree
(A Cartesian Tree of an array A[0, N - 1] is a binary tree C(A) whose root is a minimum element of A, labeled with the position i of this minimum. The left child of the root is the Cartesian Tree of A[0, i - 1] if i > 0, otherwise there's no child. The right child is defined similary for A[i + 1, N - 1].), you can just traverse through this array and if the parent of the node has lower index than the node itself than the node will be the right son of its parent and similarly if the parent of the node has higher index than the node will be left son of its parent.
Hope this helps.
It is possible to construct a Cartesian tree with only extra space for child-to-parent references (by index): so besides the input array, you would need an array of equal size, holding index values that relate to the first array. If we call that extra array parentOf, then array[parentOf[i]] will be the parent of array[i], except when array[i] is the root. In that case parentOf[i] should be like a NIL pointer (or, for example, -1).
The Wikipedia article on Cartesian trees, gives a simple construction method:
One method is to simply process the sequence values in left-to-right order [...] in a structure that allows both upwards and downwards traversal of the tree
This may give the impression that it is necessary for that algorithm to maintain both upwards and downwards links in the tree, but this is not the case. It can be done with only maintaining links from child to parent.
During the construction, a new value is injected into the path that ends in the rightmost node (having the value that was most recently added). Any child in that path is by necessity a right child of its parent.
While walking up that path in the opposite direction, from the leaf, keep track of a parent and its right child (where you came from). Once you find the insertion point, that child will get the new node as parent, and the new child will get the "old" parent as its parent.
At no instance in this process do you need to store pointers to children.
Here is the algorithm written in JavaScript. As example, the tree is populated from the input array [9,3,7,1,8,12,10,20,15,18,5]. For verification only, both the input array and the parent references are printed:
class CartesianTree {
constructor() {
this.values = [];
this.parentOf = [];
}
extend(values) {
for (let value of values) this.push(value);
}
push(value) {
let added = this.values.length; // index of the new value
let parent = added - 1; // index of the most recently added value
let child = -1; // a NIL pointer
this.values.push(value);
while (parent >= 0 && this.values[parent] > value) {
child = parent;
parent = this.parentOf[parent]; // move up
}
// inject the new node between child and parent
this.parentOf[added] = parent;
if (child >= 0) this.parentOf[child] = added;
}
}
let tree = new CartesianTree;
tree.extend([9,3,7,1,8,12,10,20,15,18,5]);
printArray("indexes:", tree.values.keys());
printArray(" values:", tree.values);
printArray("parents:", tree.parentOf);
function printArray(label, arr) {
console.log(label, Array.from(arr, value => (""+value).padStart(3)).join(" "));
}
You can use a heap to store your tree, essentially it is an array where the first element int he array is the root, the second is the left child of the root the third the right, etc.. it is much cheaper but requires a little more care when programming it.
http://en.wikipedia.org/wiki/Binary_heap

How to find the rank of a node in an AVL tree?

I need to implement two rank queries [rank(k) and select(r)]. But before I can start on this, I need to figure out how the two functions work.
As far as I know, rank(k) returns the rank of a given key k, and select(r) returns the key of a given rank r.
So my questions are:
1.) How do you calculate the rank of a node in an AVL(self balancing BST)?
2.) Is it possible for more than one key to have the same rank? And if so, what woulud select(r) return?
I'm going to include a sample AVL tree which you can refer to if it helps answer the question.
Thanks!
Your question really boils down to: "how is the term 'rank' normally defined with respect to an AVL tree?" (and, possibly, how is 'select' normally defined as well).
At least as I've seen the term used, "rank" means the position among the nodes in the tree -- i.e., how many nodes are to its left. You're typically given a pointer to a node (or perhaps a key value) and you need to count the number of nodes to its left.
"Select" is basically the opposite -- you're given a particular rank, and need to retrieve a pointer to the specified node (or the key for that node).
Two notes: First, since neither of these modifies the tree at all, it makes no real difference what form of balancing is used (e.g., AVL vs. red/black); for that matter a tree with no balancing at all is equivalent as well. Second, if you need to do this frequently, you can improve speed considerably by adding an extra field to each node recording how many nodes are to its left.
Rank is the number of nodes in the Left sub tree plus one, and is calculated for every node. I believe rank is not a concept specific to AVL trees - it can be calculated for any binary tree.
Select is just opposite to rank. A rank is given and you have to return a node matching that rank.
The following code will perform rank calculation:
void InitRank(struct TreeNode *Node)
{
if(!Node)
{
return;
}
else
{ Node->rank = 1 + NumeberofNodeInTree(Node->LChild);
InitRank(Node->LChild);
InitRank(Node->RChild);
}
}
int NumeberofNodeInTree(struct TreeNode *Node)
{
if(!Node)
{
return 0;
}
else
{
return(1+NumeberofNodeInTree(Node->LChild)+NumeberofNodeInTree(Node->RChild));
}
}
Here is the code i wrote and worked fine for AVL Tree to get the rank of a particular value. difference is just you used a node as parameter and i used a key a parameter. you can modify this as your own way. Sample code:
public int rank(int data){
return rank(data,root);
}
private int rank(int data, AVLNode r){
int rank=1;
while(r != null){
if(data<r.data)
r = r.left;
else if(data > r.data){
rank += 1+ countNodes(r.left);
r = r.right;
}
else{
r.rank=rank+countNodes(r.left);
return r.rank;
}
}
return 0;
}
[N.B] If you want to start your rank from 0 then initialize variable rank=0.
you definitely should have implemented the method countNodes() to execute this code.

Resources