Decision Tree Depth - cart

As part of my project, I have to use Decision tree that I am using "fitctree" function that is the Matlab function for classified my features that extracted with PCA.
I want to control number of Tree and tree depth in fitctree function.
anyone knows how can I do this? for example changed the number of trees to 200 and tree depth to 10. How am I going to do this?
Is it possible to change these value in decision tree?
Best,

fitctree offers only input parameters to control the depth of the resulting tree:
MaxNumSplits
MinLeafSize
MinParentSize
https://de.mathworks.com/help/stats/classification-trees-and-regression-trees.html#bsw6baj
You have to play with those parameters to control the depth of your tree. Thats because the decision tree only stops growing when purity is reached.
Another possibility would be to turn on pruning. Pruning will reduce the size of your tree by removing sections of the tree that provide little power to classify instances.

Let me assume that you are using ID3 algorithm. Its pseudocode can provide a way to control the depth of the tree.
ID3 (Examples, Target_Attribute, Attributes, **Depth**)
// Check the depth of the tree, if it is 0, we are going to break
if (Depth == 0) { break; }
// Else continue
Create a root node for the tree
If all examples are positive, Return the single-node tree Root, with label = +.
If all examples are negative, Return the single-node tree Root, with label = -.
If number of predicting attributes is empty, then Return the single node tree Root,
with label = most common value of the target attribute in the examples.
Otherwise Begin
A ← The Attribute that best classifies examples.
Decision Tree attribute for Root = A.
For each possible value, vi, of A,
Add a new tree branch below Root, corresponding to the test A = vi.
Let Examples(vi) be the subset of examples that have the value vi for A
If Examples(vi) is empty
Then below this new branch add a leaf node with label = most common target value in the examples
// We decrease the value of Depth by 1 so the tree stops growing when it reaches the designated depth
Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes – {A}, Depth - 1)
End
Return Root
What algorithm does your fictree function try to implement?

Related

Job Interview Question Using Trees, What data to save?

I was solving the following job interview question and solved most of it but failed at the last requirement.
Q: Build a data structure which supports the following functions:
Init - Initialise Empty DS. O(1) Time complexity.
SetPositiveInDay(d,x) - Add to the DS that in day d exactly x new people were infected with covid-19. O(log n)Time complexity.
WorseBefore(d) - From the days inserted into the DS and smaller than d return the last one which has more newly infected people than d. O(log n)Time complexity.
For example:
Init()
SetPositiveInDay(1,10)
SetPositiveInDay(2,20)
SetPositiveInDay(3,15)
SetPositiveInDay(5,17)
SetPositiveInDay(23,180)
SetPositiveInDay(8,13)
SetPositiveInDay(13,18)
WorstBefore(13) // Returns day #2
SetPositiveInDay(10,19)
WorstBefore(13) // Returns day #10
Important note: you can't suppose that days will be entered by order and can't suppose too that there won't be "gaps" between days. (Some days may not be saved in the DS while those after it may be).
What I did?
I used AVL tree (I could use 2-3 tree too).
For each node I have:
Sick - Number of new infected people in that day.
maxLeftSick - Max number of infected people for left son.
maxRightSick - Max number of infected people for right son.
When inserted a new node I made sure that in rotation data won't get missed plus, for each single node from the new one till the root I did:
But I wasn't successful implementing WorseBefore(d).
Where to search?
First you need to find the node node corresponding to d in the tree ordered by days. Let x = Sick(node). This can be done in O(log n).
If maxLeftSick(node) > x, the solution must be in the left subtree of node. Search for the solution there and return the answer. This can be done in O(log n) - see below.
Otherwise, traverse the tree upwards towards the root, starting from node, until you find the first node nextPredecessor satisfying this property (this takes O(log n)):
nextPredecessor is smaller than node,
and either
Sick(nextPredecessor) > x or
maxLeftSick(nextPredecessor) > x.
If no such node exists, we give up. In case 1, just return nextPredecessor since that is the best solution.
In case 2, we know that the solution must be in the left subtree of nextPredecessor, so search there and return the answer. Again, this takes O(log n) - see below.
Note that there is no need to search in the right subtree of nextPredecessor since the only nodes that are smaller than node in that subtree would be the left subtree of node itself, and we have already excluded that.
Note also that it is not necessary to traverse further up the tree than nextPredecessor since those nodes are even smaller, and we are looking for the largest node satisfying all constraints.
How to search?
OK, so how do we search for the solution in a subtree? Finding the largest day within a subtree rooted in q that is worse than an infection number x is simple using the maxLeftSick and maxRightSick information:
If q has a right child and maxRightSick(q) > x then search in the right subtree of q.
If q has no right child and Sick(q) > x, return Day(q).
If q has a left child and maxLeftSick(q) > x then search in the left subtree of q.
Otherwise there is no solution within the subtree q.
We are effectively using maxLeftSick and maxRightSick to prune the search tree to include only "worse" nodes, and within that pruned tree we get the right most node, i.e. the one with the largest day.
It is easy to see that this algorithm runs in O(log n) where n is the total number of nodes since the number of steps is bounded by the height of the tree.
Pseudocode
Here is the pseudocode (assuming maxLeftSick and maxRightSick return -1 if no corresponding child node exists):
// Returns the largest day smaller than d such that its
// infection number is larger than the infection number on day d.
// Returns -1 if no such day exists.
int WorstBefore(int d) {
node = find(d);
// try to find the solution in the left subtree
if (maxLeftSick(node) > Sick(node)) {
return FindLastWorseThan(node -> left, Sick(node));
}
// move up towards root until we find the first node
// that is smaller than `node` and such that
// Sick(nextPredecessor) > Sick(node) or
// maxLeftSick(nextPredecessor) > Sick(node).
nextPredecessor = findNextPredecessor(node);
if (nextPredecessor == null) return -1;
// Case 1
if (Sick(nextPredecessor) > Sick(node)) return nextPredecessor;
// Case 2: maxLeftSick(nextPredecessor) > Sick(node)
return FindLastWorseThan(nextPredecessor -> left, Sick(node));
}
// Finds the latest day within the given subtree with root "node" where
// the infection number is larger than x. Runs in O(log(size(q)).
int FindLastWorseThan(Node q, int x) {
if ((q -> right) = null and Sick(q) > x) return Day(q);
if (maxRightSick(q) > x) return FindLastWorseThan(q -> right, x);
if (maxLeftSick(q) > x) return FindLastWorseThan(q -> left, x);
return -1;
}
First of all, your chosen data structure looks fine to me. You did not mention it explicitly, but I assume that the "key" you use in the AVL tree is the day number, i.e. an in-order traversal of the tree would list the nodes in their chronological order.
I would just suggest a cosmetic change: store the maximum value of sick in the node itself, so that you don't have two similar informations (maxLeftSick and maxRightSick) stored in one node instance, but move those two informations to the child nodes, so that your node.maxLeftSick is actually stored in node.left.maxSick, and similarly node.maxRightSick is stored in node.right.maxSick. This is of course not done when that child does not exist, but then we don't need that information either. In your structure maxLeftSick would be 0 when left is not defined. In my proposed structure, you would not have that value -- the 0 would follow naturally from the fact that there is no left child. In my proposal, the root node would have an information in maxSick which is not present in yours, and which would be the sum of your root.maxLeftSick and root.maxRightSick. This information would not really be used, but it is just there to make the structure consistent throughout the tree.
So you would just store one maxSick, which considers the current node's sick value also in that maximum. The processing you do during rotations will need to change accordingly, but will not become more complex.
I will assume that your AVL tree is single-threaded, i.e. you don't keep track of parent-pointers. So create a find method which will return the path to the node to be found. For instance, in Python syntax, it could look like this:
def find(self, day):
node = self.root
path = [] # an array of nodes
while node:
path.append(node)
if node.day == day: # bingo
return path
if day < node.day:
node = node.left
else:
node = node.right
Then the worstBefore method could look like this:
def worstBefore(self, day):
path = self.find(day)
if not path:
return # day not found
# get number of sick people on that day:
sick = path[-1].sick
# look for recent day with greater number of sick
while path:
node = path.pop() # walk upward, starting with found node
if node.day < day and node.sick > sick:
return node.day
if node.left and node.left.maxSick > sick:
# we will find the result in this subtree
node = node.left
while True:
if node.right and node.right.maxSick > sick:
node = node.right
elif node.sick > sick: # bingo
return node.day
else:
node = node.left
So the path returned by the find method will be used to get the parents of a node when you need to backtrack upwards in the tree along that path.
If along that path you find a left child whose maxSick is greater, then you know that the targeted node must be in that subtree. It is then a matter to walk down that subtree in a controlled way, choosing the right child when it still has maxSick greater. Otherwise check the current node's sick value and return that one if that value is greater. Otherwise go left, and repeat.
While there is no such left sub tree, go up along the path. If that parent would be a match, then return it (make sure to verify the day number). Keep checking for left sub trees that have a larger maxSick.
This runs in O(logn) because you first will walk zero or more steps upward and then zero or more steps downward (in a left subtree).
You can see your example scenario run on repl.it. There I focussed on this question, and didn't implement the rotations.

Non-boundary nodes in a binary tree

I have a binary tree, i want to print all non-boundary nodes.
Boundary Nodes:- All leaf nodes+all nodes on path from root to leftest node+all nodes from root to rightest node.
I ave done this using an extra boolean in tree structure to identify whether it's boundary node or not and then doing a traversal and printing if not boundary nodes. Can someone come up with a better approach, because it's using some extra space(though very less).
One observation you might find helpful is that a non-boundary node in a binary tree is one that (a) isn't a leaf and (b) is one where along the access path to the node, you've taken a step left and a step right. Therefore, one option would be to do a normal tree traversal, tracking whether you've gone left and gone right along the way. Here's some pseudocode:
function printNonBoundaryNodesRec(root, goneLeft, goneRight) {
if (root == null or root is a leaf) return;
if (goneLeft and goneRight) print root.value
printNonBoundaryNodesRec(root.left, true, goneRight);
printNonBoundaryNodesRec(root.right, goneLeft, true);
}
function printNonBoundaryNodes(root) {
printNonBoundaryNodesRec(root, false, false);
}
Hope this helps!

How do I transfer a normal binary tree into a "smarter" binary tree where each node knows its parents, total subnodes and level?

I'm still getting used to data structures, and I'm comfortable with traversing binary trees in the various ways, but I'm presented now with a situation where I have a normal binary tree, constructed of nodes that only know have data, left and right attributes.
However I want to transfer it into a "smarter" binary tree. This tree is to know its parent node, its total subnodes, and the level in the total tree it is at.
I'm really struggling with how I'd go about transferring the one "dumber" tree into the smarter version. My first instinct is to traverse recursively, but I'm not sure how I'd then be able to distinguish the parent and the level.
Copy the old tree to a new tree, using the normal recursive methods to traverse the original.
Since you're adding new attributes to the nodes, I presume you'll need to construct new nodes with fields for the new attributes.
Define a recursive function to copy the (sub)tree rooted at a given node. It needs as input its depth and parent. (The parent, of course, needs to be what will be its parent in the new tree.) Let it return the root of the new (sub)tree.
function copy_node (old_node, new_parent, depth) -> returns new_node {
new_node = new node
new_node.data = old_root.data // whatever that data might be
new_node.depth = depth
new_node.parent = parent
new_node.left = copy_node (old_node.left, new_node, depth + 1)
new_node.right = copy_node (old_node.right, new_node, depth + 1)
return new_node }
Copy the whole tree with
new_tree = copy_node (old_tree, nil, 0)
If you're using a language where fields can be added to existing objects willy-nilly, you don't even have to do the extra copying:
function adorn_node (node, parent, depth) {
node.parent = parent
node.depth = depth
adorn_node (node.left, node, depth + 1)
adorn_node (node.right, node, depth + 1) }
and start the ball rolling with
adorn_node (root, nil, 0)
That having been said, you will probably discover that there is a very good reason why most binary tree implementations do not contain these extra fields. It's a lot of work to maintain them across the many different operations you want to perform on trees. depth, especially, is hard to keep correct when you need to re-balance a tree.
And the fields don't generally buy you anything. Most algorithms that operate on trees do so using recursive functions, and as you can see from the above examples it's really easy to re-calculate both parent and depth on the fly while you're walking the tree. They don't need to be stored in the nodes themselves.
Tree-balancing often needs to know the difference in heights of the left and right subtrees. ("depth" is the distance to the root; "height" is the distance to the most distant leaf node in the subtree.) height is not so easy to calculate on the way down from the root, but fortunately you're usually only interested in which of the subtrees has the greatest height, and for that it's usually sufficient to store only the values -1, 0, +1 in each node.

How to find the set of trees every one of which spans over another given tree?

Imagine it's given a set of trees ST and each vertex of every tree is labeled. Also another tree T is given (also with labels vertices). The question is how can I find which trees of the ST can span over the tree T starting from the root of T in such a way that the labels of the vertices of the spanning tree T' coincide with those labels of T 's vertices. Note that the children of every vertex of T should be either completely covered or not covered at all - partial covering of children is not allowed. Stated in other words: Given a tree and the following procedure: pick a vertex and remove all vertices and edges below this vertex (except the vertex itself). Find those trees of ST such that each tree is generated with a series of procedures applied to T.
For example given the tree T
the trees
cover T and the tree
does not because this tree has children 3, 5 unlike T which has 2, 3 as children. The best thing I was able to think of was either to brute force it or to find the set of tree every one of which has the same root label as T and then to search for the answer among those trees but I guess neither of those two approaches is the optimal one. I was thinking of somehow hashing the trees but nothing came out. Any thoughts?
Notes:
The trees are not necessarily binary
A tree T can cover another tree T' if they share a root
The tree is ordered meaning that you cannot swap the position of any two children.
TL; DR Find a efficient algorithm which on query with given tree T the algorithm finds all trees from a given(fixed/static) set ST which are able to cover T.
I'll sketch an answer and then provide some working source code.
First off, you need an algorithm to hash a tree. We can assume, without loss of generality, that the children of each of your tree's nodes are ordered from least to greatest (or vice versa).
Run this algorithm on every member of ST and save the hashes.
Now, take your test tree T and generate all of its subtrees TP that retain the original root. You can do this (perhaps inefficiently) by:
Making a set S of its nodes
Generating the power set P of S
Generating the subtrees by removing the nodes present in each member of P from copies of T
Adding those subtrees which retain the original root to TP.
Now generate a set of all of the hashes of TP.
Now check each of your ST hashes for membership in TP.
ST hash storage requires O(n) space in ST, and possibly the space to hold the trees.
You can optimize the membership code so that it requires no storage space (I have not done this in my test code). The code will require approximately 2N checks, where N is the number of nodes in **T.
So the algorithm runs in O(H 2**N), where H is the size of ST and N is the number of nodes in T. The best way of speeding this up is to find an improved algorithm for generating the subtrees of T.
The following Python code accomplishes this:
#!/usr/bin/python
import itertools
import treelib
import Crypto.Hash.SHA
import copy
#Generate a hash of a tree by recursively hashing children
def HashTree(tree):
digester=Crypto.Hash.SHA.new()
digester.update(str(tree.get_node(tree.root).tag))
children=tree.get_node(tree.root).fpointer
children.sort(key=lambda x: tree.get_node(x).tag, cmp=lambda x,y:x-y)
hash=False
if children:
for child in children:
digester.update(HashTree(tree.subtree(child)))
hash = "1"+digester.hexdigest()
else:
hash = "0"+digester.hexdigest()
return hash
#Generate a power set of a set
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return itertools.chain.from_iterable(itertools.combinations(s, r) for r in range(len(s)+1))
#Generate all the subsets of a tree which still share the original root
#by using a power set of all the tree's nodes to remove nodes from the tree
def TreePowerSet(tree):
nodes=[x.identifier for x in tree.nodes.values()]
ret=[]
for s in powerset(nodes):
culled_tree=copy.deepcopy(tree)
for n in s:
try:
culled_tree.remove_node(n)
except:
pass
if len([x.identifier for x in culled_tree.nodes.values()])>0:
ret.append(culled_tree)
return ret
def main():
ST=[]
#Generate a member of ST
treeA = treelib.Tree()
treeA.create_node(1,1)
treeA.create_node(2,2,parent=1)
treeA.create_node(3,3,parent=1)
ST.append(treeA)
#Generate a member of ST
treeB = treelib.Tree()
treeB.create_node(1,1)
treeB.create_node(2,2,parent=1)
treeB.create_node(3,3,parent=1)
treeB.create_node(4,4,parent=2)
treeB.create_node(5,5,parent=2)
ST.append(treeB)
#Generate hashes for members of ST
hashes=[(HashTree(tree), tree) for tree in ST]
print hashes
#Generate a test tree
T=treelib.Tree()
T.create_node(1,1)
T.create_node(2,2,parent=1)
T.create_node(3,3,parent=1)
T.create_node(4,4,parent=2)
T.create_node(5,5,parent=2)
T.create_node(6,6,parent=3)
T.create_node(7,7,parent=3)
#Generate all the subtrees of this tree which still retain the original root
Tsets=TreePowerSet(T)
#Hash all of the subtrees
Thashes=set([HashTree(x) for x in Tsets])
#For each member of ST, check to see if that member is present in the test
#tree
for hash in hashes:
if hash[0] in Thashes:
print [x for x in hash[1].expand_tree()]
main()
To verify that one tree covers another, one must look at all vertices of the first tree at least once. It is trivial to verify that a tree covers another by looking at all vertices of the first tree exactly once. Thus the simplest possible algorithm is already optimal, if it's only needed to check one tree.
Everything below are untested fruits of my sick imagination.
If there are many possible T that must be checked against the same ST, then it's possible to store trees of ST as sets of facts like these
root = 1
children of node 1 = (2, 3)
children of node 2 = ()
children of node 3 = ()
These facts can be stored in a standard relational DB in two tables, "roots" (fields "tree" and rootnode") and "branches" (fields "tree", "node" and "children"). then an SQL query or a series of queries can be built to find matching trees quickly. My SQL-fu is rudimentary so I could not manage it in a single query, but I'm believe it should be possible.

Figuring a max repetitive sub-tree in an object tree

I am trying to solve a problem of finding a max repetitive sub-tree in an object tree.
By the object tree I mean a tree where each leaf and node has a name. Each leaf has a type and a value of that type associated with that leaf. Each node has a set of leaves / nodes in certain order.
Given an object tree that - we know - has a repetitive sub-tree in it.
By repetitive I mean 2 or more sub-trees that are similar in everything (names/types/order of sub-elements) but the values of leaves. No nodes/leaves can be shared between sub-trees.
Problem is to identify these sub-trees of the max height.
I know that the exhaustive search can do the trick. I am rather looking for more efficient approach.
you could implement a dfs traversal generating a hash value for each node. Store these values with the node height in a simple array. Sub-tree candidates are duplicate values, just check that the candidates are ok since two different sub-trees could yield same hash value.
Assuming the leafs and internal nodes are all of type Node and that standard access and traversal functions are available :
procedure dfs_update( node : Node, hashmap : Hashmap )
begin
if is_leaf(node) then
hashstring = concat("LEAF",'|',get_name_str(node),'|',get_type_str(node))
else // node is an internal node
hashstring = concat("NODE",'|',get_name_str(node))
for each child in get_children_sorted(node)
dfs_update(child,hashmap)
hashstring = concat(hashstring,'|',get_hash_string(hashmap,child))
end for
end if
// only a ref to node is added to the hashmap, we could also add
// the node's height, hashstring, whatever could be useful and inapropriate
// to keep in the Node ds
add(hashmap, hash(hashstring),node)
end
The tricky part is after a dfs_update, we have to get the list of collinding nodes in the hasmap by descending height and check two by two they are really repetitive.

Resources