Where to find balanced tree implementation? - data-structures

I was wondering whether anyone knows of somewhere on the Internet where I can copy-and-paste (for learning purposes) an implementation of a binary search tree that implements both balancing and an algorithm to ask how many nodes are less than a particular value. I want to be able to query this data structure and ask "How many nodes are < x in this data structure?" The whole purpose is to answer this latter type of query, but the balancing is important too because I want to be able to handle large unbalanced sequences of entries.
I prefer implementations in Python, C, C++, or Java, but others are welcome.

If it's still relevant, you could maintain in each node the size of the subtree, like in
http://www.codeproject.com/script/Articles/ViewDownloads.aspx?aid=12347&zep=HeightBalancedTree.h&rzp=%2fKB%2farchitecture%2favl_cpp%2f%2favl_cpp.zip
Notice the FindComparable function. If the object is found, it returns its index (the number of nodes that are smaller than the searched object).
In case the searched object is not in the tree, you want to get the index of the minimal node that is larger than the searched object.
Notice what happens to the index when the object is not found.
The last Node::GetSize(Node::GetRight(node))) or Node::GetSize(Node::GetLeft(node))) will be evaluated to 0, so you have 2 cases:
If the last turn was right (cmp > 0) you will get the index of the maximal node that is smaller than the searched object plus one - which is exactly the value that you want.
If the last turn was left (cmp < 0) you will get the index of the minimal node that is larger than the searched object minus one.
Therefore, instead of returning Size() when the object was not found, you could return:
index + (cmp < 0)?1:0;

Related

How to find nodes fast in an unordered tree

I have an unordered tree in the form of, for example:
Root
A1
A1_1
A1_1_1
A1_1_2
A1_1_2_1
A1_1_2_2
A1_1_2_3
A1_1_3
A1_1_n
A1_2
A1_3
A1_n
A2
A2_1
A2_2
A2_3
A2_n
The tree is unordered
each child can have a random N count of children
each node stores an unique long value.
the value required can be at any position.
My problem: if I need the long value of A1_1_2_3, first time I will traverse the nodes I do depth first search to get it, however: on later calls to the same node I must get its value without a recursive search. Why? If this tree would have hundreds of thousands of nodes until it reaches my A1_1_2_3 node, it would take too much time.
What I thought of, is to leave some pointers after the first traverse. E.g. for my case, when I give back the long value for A1_1_2_3 I also give back an array with information for future searches of the same node and say: to get to A1_1_2_3, I need:
first child of Root, which is A1
first child of A1, which is A1_1
second child of A1_1, which is A1_1_2
third child of A1_1_2, which is what I need: A1_1_2_3
So I figured I would store this information along with the value for A1_1_2_3 as an array of indexes: [0, 0, 1, 2]. By doing so, I could easily recreate the node on subsequent calls to the A1_1_2_3 and avoid recursion each time.
However the nodes can change. On subsequent calls, I might have a new structure, so my indexes stored earlier would not match anymore. But if this happens, I thought whnever I dont find the element anymore, I would recursively go back up a level and search for the item, and so on until I find it again and store the indexes again for future references:
e.g. if my A1_1_2_3 is now situated in this new structure:
A1_1
A1_1_0
A1_1_1
A1_1_2
A1_1_2_1
A1_1_2_2
A1_1_21_22
A1_1_2_3
... in this case the new element A1_1_0 ruined my stored structure, so I would go back up a level and search children again recursively until I find it again.
Does this even make sense, what I thought of here, or am I overcomplicating things? Im talking about an unordered tree which can have max about three hundreds of thousands of nodes, and it is vital that I can jump to nodes as fast as possible. But the tree can also be very small, under 10 nodes.
Is there a more efficient way to search in such a situation?
Thank you for any idea.
edit:
I forgot to add: what I need on subsequent calls is not just the same value, but also its position is important, because I must get the next page of children after that child (since its a tree structure, Im calling paging on nodes after the initially selected one). Hope it makes more sense now.

A red black tree with the same key multiple times: store collections in the nodes or store them as multiple nodes?

Apparently you could do either, but the former is more common.
Why would you choose the latter and how does it work?
I read this: http://www.drdobbs.com/cpp/stls-red-black-trees/184410531; which made me think that they did it. It says:
insert_always is a status variable that tells rb_tree whether multiple instances of the same key value are allowed. This variable is set by the constructor and is used by the STL to distinguish between set and multiset and between map and multimap. set and map can only have one occurrence of a particular key, whereas multiset and multimap can have multiple occurrences.
Although now i think it doesnt necessarily mean that. They might still be using containers.
I'm thinking all the nodes with the same key would have to be in a row, because you either have to store all nodes with the same key on the right side or the left side. So if you store equal nodes to the right and insert 1000 1s and one 2, you'd basically have a linked list, which would ruin the properties of the red black tree.
Is the reason why i can't find much on it that it's just a bad idea?
down side of store as multiple nodes:
expands tree size, which make search slower.
if you want to retrieve all values for key K, you need M*log(N) time, where N is number of total nodes, M is number of values for key K, unless you introduce extra code (which complicates the data structure) to implement linked list for these values. (if storing collection, time complexity only take log(N), and it's simple to implement)
more costly to delete. with multi-node method, you'll need to remove node on every delete, but with collection-storage, you only need to remove node K when the last value of key K is deleted.
Can't think of any good side of multi-node method.
Binary Search trees by definition cannot contain duplicates. If you use them to produce a sorted list throwing out the duplicates would produce an incorrect result.
I am working on an implementation of Red Black trees in PHP when I ran into the duplicate issue. We are going to use the tree for sorting and searching.
I am considering adding an occurrence value to the node data type. When a duplicate is encountered just increment occurrence. When walking the tree to produce output just repeat the value by the number of occurrences. I think I would still have a valid BST and avoid having a whole chain of duplicate values which preserve the optimal search time.

Random variable-length encoded numbers with uniform distribution

Suppose I have data presented with variable-length encoding when I can retrieve the data parsing some virtual b-tree and stopping when I reach the item (similar to Huffman encoding). There is unknown number of items (in the best case only the upper limit is known). Is there an algorithm to generate uniformly distributed numbers? The problem is that a coin-based algorithm will give non-uniform result in this case, for example if there's a number encoded as 101 and there's a number encoded 10010101, the latter will appear very rarely comparing to the former.
UPDATE: In other words, I have a set of maximum N elements (but maybe fewer) when every element can be addressed with arbitrary number of bits (and with accordance with informational theory, so if one is encoded 101 then no other element can be encoded with the same prefix). So it's more like B-Tree when I go left or right depending on a bit and at some moment I get to the data item. I want to get a sequence of random numbers addressed with this technique, but the distribution of them should be uniform (the example why choosing randomly left-right won't work is above, the numbers 101 and 10010101)
Thanks
Max
I can think of three basic methods, one of which involves frequent reguessing and one of which involves keeping extra information. I think that doing one or the other of these things is unavoidable. I'm going to begin with the extra information one:
In each node, store a number count which represents the number of descendants it has. For every node, you'll need to have a number between 1 and count for that node to tell you whether to go left or right by comparing it to the left child's count. Here's the algorithm:
n := random integer between 1 and root.count
node := route
while node.count != 1
if n <= node.left.count
node = node.left
else
node = node.right
n = n - node.left.count
So, essentially, we're imposing a left-to-right ordering on all nodes and selecting the nth one from the left. This is fairly quick, only having a O(depth of tree), which is likely the best we can do without doing something like also building a vector which contains all the node labels. This also adds an overhead of O(depth of tree) to any changes to the tree since counts must be corrected. If you're going the other way and never changing the tree at all but going to be selecting random nodes a lot, just bit the bullet and put all of the node labels in a vector. That way you can select a random one in O(1) after O(N) initial set-up time.
If, however, you don't want to use up any storage space, here's an alternative with a lot of reguessing. First find a bound (which I'll label B) for the depth of the tree (we can use N-1 if needed, but obviously, that's a very loose bound. The tighter the bound which can be found, the faster the algorithm runs). Next we're going to generate a possible node label in a random, but even way. There are 2^(B+1)-1 possibilities. It's not just 2^B because, for example, the string "0011" and "11" are completely different strings. As a result, we need to count all possible binary strings of length between 0 and B. Obviously, we have 2^i strings of length i. So for strings of length i or less, we have sum(i=0 to B){2^i} = 2^(B+1)-1. So, we can just chose a number between 0 and 2^(B+1)-2 and then find the corresponding node label. Of course, the mapping from numbers to node labels isn't trivial, so I'll provide it here.
We convert the number we have chosen into a string of bits in the ordinary way. Then, reading from the left, if the first digit is a 0, then the node label is the remaining string to the right (possibly the empty string, which is a valid node label although not likely to be in use). If the first digit is a 1, then we throw it away and repeat this process. Thus, if B=4, then the node label "0001" would come from the number "00001". The node label "001" would come from the number "10001". The node label "01" would come from the number "11001". The node label "1" would come from the number "11101". And the node label "" would come from the number "11110". We did not include the number 2^(B+1)-1 ("11111" in this case) which has no valid interpretation under this scheme. I'll leave it as an exercise to the reader to prove to themselves that every string from length 0 to B can be represented under this scheme. Rather than trying to prove it, I'll just assert that it will work.
So now we have a node label. The next step is to see if that label exists by traversing the tree. If it does, we're done. If it doesn't, then choose a new number and start over (that's the reguessing part). It's likely to have to reguess a lot, since only a small fraction of legal node labels will be in use, but this won't skew the fairness, just increase the time.
Here's a pseudo-code version of this process in four functions:
function num_to_binary_list(n,digits) =
if digits == 0 return ()
if n mod 2 == 0 return 0 :: num_to_digits(n/2,digits-1)
else return 1 :: num_to_digits((n-1)/2,digits-1)
function binary_list_to_node_label_list(l) =
if l.head() == 0 return l.tail()
else return binary_list_to_node_label_list(l.tail())
function check_node_label_list_against_tree(str,node) =
if node == null return false,null
if str.isEmpty()
if node.isLeaf() return true,node
else return false,null
if str.head() == 0 return check_node_label_list_against_tree(str.tail(),node.left)
else check_node_label_list_against_tree(str.tail,node.right)
function generate_random_node tree b =
found := false
while (not found)
x := random(0,2**(b+1)-2) // We're assuming that this random selects inclusively
node_label := binary_list_to_node_label(num_to_binary_list(x,b+1))
found,node := check_node_label_list_against_tree(node_label,tree)
return node
The timing analysis for this, of course, is pretty horrendous. Basically, the while loop will run an average of (2^(B+1)-1)/N times. So, in the worst case, it's O((2^N)/N) which is terrible. In the best case, B would be on the order of log(N), so it would be roughly O(1), but that requires that the tree be fairly balanced which it may not be. Still, if you really want no extra space, this method does that.
I don't really think that you can do better than this last method without storing some information. It sounds appealing to be able to traverse the tree, making random decisions as you go, but without storing additional information about the structure, you're just not going to be able to do that. Every time you make a branching decision, you could have just one node on the left side and a million nodes on the right side or it could have a million nodes on the left side and just one on the right side. Because those are both possible and you don't know which is the case, there's simply no way to make an even random decision between the two sides. Obviously 50-50 doesn't work and any other choice is going to be similarly problematic.
So, if you don't want extra space, the second method will work, but be slow. If you don't mind adding some extra space, the first method will work and be fast. And, as I said earlier, if you're not going to be changing the tree and you'll be selecting a lot of random nodes, then bite the bullet and just traverse the tree and stick all leaf nodes in a self-growing array or vector and then pick from that.

Finding the width of a directed acyclic graph... with only the ability to find parents

I'm trying to find the width of a directed acyclic graph... as represented by an arbitrarily ordered list of nodes, without even an adjacency list.
The graph/list is for a parallel GNU Make-like workflow manager that uses files as its criteria for execution order. Each node has a list of source files and target files. We have a hash table in place so that, given a file name, the node which produces it can be determined. In this way, we can figure out a node's parents by examining the nodes which generate each of its source files using this table.
That is the ONLY ability I have at this point, without changing the code severely. The code has been in public use for a while, and the last thing we want to do is to change the structure significantly and have a bad release. And no, we don't have time to test rigorously (I am in an academic environment). Ideally we're hoping we can do this without doing anything more dangerous than adding fields to the node.
I'll be posting a community-wiki answer outlining my current approach and its flaws. If anyone wants to edit that, or use it as a starting point, feel free. If there's anything I can do to clarify things, I can answer questions or post code if needed.
Thanks!
EDIT: For anyone who cares, this will be in C. Yes, I know my pseudocode is in some horribly botched Python look-alike. I'm sort of hoping the language doesn't really matter.
I think the "width" you're considering here isn't really what you want - the width depends on how you assign levels to each node where you have some choice. You noticed this when you were deciding whether to assign all sources to level 0 or all sinks to the max level.
Instead, you just want to count the number of nodes and divide by the "critical path length", which is the longest path in the dag. This gives the average parallelism for the graph. It depends only on the graph itself, and it still gives you an indication of how wide the graph is.
To compute the critical path length, just do what you're doing - the critical path length is the maximum level you end up assigning.
In my opinion when you're doing this type of last minute development, its best to keep the new structures separate from the ones you are already using. At this point, if I were pressed by time I would go for a simpler solution.
Create an adjacency matrix for the graph using the parent data (should be easy)
Perform a topological sort using this matrix. (or even use tsort if pressed for time)
Now that you have a topological sort, create an array level, one element for each node.
For each node:
If the node has no parents set its level to 0
Otherwise set it to the minimum of level its parents + 1.
Find the maximum level width.
The question is as Keith Randall asked, is this the right measurement you need?
Here's what I (Platinum Azure, the original author) have so far.
Preparations/augmentations:
Add "children" field to linked list ("DAG") node
Add "level" field to "DAG" node
Add "children_left" field to "DAG" node. This is used to make sure that all children are examined before a parent is examined (in a later stage of the algorithm).
Algorithm:
Find the number of immediate children for all nodes; also, determine leaves by adding nodes with children==0 to list.
for l in L:
l.children = 0
for l in L:
l.level = 0
for p in l.parents:
++p.children
Leaves = []
for l in L:
l.children_left = l.children
if l.children == 0:
Leaves.append(l)
Assign every node a "reverse depth" level. Normally by depth, I mean topologically sort and assign depth=0 to nodes with no parents. However, I'm thinking I need to reverse this, with depth=0 corresponding to leaves. Also, we want to make sure that no node is added to the queue without all its children "looking at it" first (to determine its proper "depth level").
max_level = 0
while !Leaves.empty():
l = Leaves.pop()
for p in l.parents:
--p.children_left
if p.children_left == 0:
/* we only want to append parents with for sure correct level */
Leaves.append(p)
p.level = Max(p.level, l.level + 1)
if p.level > max_level:
max_level = p.level
Now that every node has a level, simply create an array and then go through the list once more to count the number of nodes in each level.
level_count = new int[max_level+1]
for l in L:
++level_count[l.level]
width = Max(level_count)
So that's what I'm thinking so far. Is there a way to improve on it? It's linear time all the way, but it's got like five or six linear scans and there will probably be a lot of cache misses and the like. I have to wonder if there isn't a way to exploit some locality with a better data structure-- without actually changing the underlying code beyond node augmentation.
Any thoughts?

A data structure based on the R-Tree: creating new child nodes when a node is full, but what if I have a lot of objects at the exact same position?

I realize my title is not very clear, but I am having trouble thinking of a better one. If anyone wants to correct it, please do.
I'm developing a data structure for my 2 dimensional game with an infinite universe. The data structure is based on a simple (!) node/leaf system, like the R-Tree.
This is the basic concept: you set howmany childs you want a node (a container) to have maximum. If you want to add a leaf, but the node the leaf should be in is full, then it will create a new set of nodes within this node and move all current leafs to their new (more exact) node. This way, very populated areas will have a lot more subdivisions than a very big but rarely visited area.
This works for normal objects. The only problem arises when I have more than maxChildsPerNode objects with the exact same X,Y location: because the node is full, it will create more exact subnodes, but the old leafs will all be put in the exact same node again because they have the exact same position -- resulting in an infinite loop of creating more nodes and more nodes.
So, what should I do when I want to add more leafs than maxChildsPerNode with the exact same position to my tree?
PS. if I failed to explain my problem, please tell me, so I can try to improve the explanation.
Update: this is how I check if all leafs in a full node have identical positions:
//returns whether all leafs in the given leaf list are identical
private function allLeafsHaveSamePos(leafArr:Array<RTLeaf<T>>):Bool {
if (leafArr.length > 1) {
var lastLeafTopX:Float = leafArr[0].topX;
var lastLeafTopY:Float = leafArr[0].topY;
for (i in 1...leafArr.length) {
if ((leafArr[i].topX != lastLeafTopX) || (leafArr[i].topY != lastLeafTopY)) return false;
}
}
return true;
}
I would like to ask a question...
is it that important than the maxChildsPerNode constraint be respected ?
I would rather think of this maximum as a hint to the structure for when to split, and simply ignore it when there is no meaningful way to perform the split.
Of course you'd better rethink the name then, otherwise it'd be odd for the next maintainer.
In pseudo code I would use something like this:
def AddToNode(node, item):
node.items.append(item)
if len(node.items) > node.splitHint:
leftNode = Node(node.splitHint)
rightNode = Node(node.splitHint)
node.split(leftNode, rightNode)
if len(leftNode.items) == 0 or len(rightNode.items) == 0:
node.splitHint *= 1.5 # famous golden ratio ;)
else:
node.items = [leftNode, rightNode]
The important step is to modify the hint when it's detected than we can't abide by it in order not to perform this check at each insertion (this way we obtain a constant amortized cost).
It looks like a bit of a mismatch between your data and your structure, since you have a structure that assumes N objects within an arbitrarily large area when you're supplying it with >N objects on an infinitely small point. It might be worth using a different structure for this data?
Hack fix: apply a tiny random displacement to your newly created objects. This should allow the space to be subdivided by the existing algorithm.
Better fix: ensure that your algorithm for splitting a leaf node always generates at least 2 new leaf nodes to begin with. When reassigning objects to the new leaf nodes, or when performing normal insertions, iterate over all the candidates, and if more than one is equally suitable then you can tie-break based on how full they are. This should result in your co-located players ending up split evenly across the otherwise identical nodes.
From common sense I'd not assume having two objects in the same position ever, but if this is a part of the idea, then I would introduce one more axis, say 'spin', an integer number, and impose a restriction that all your objects are fermions (cannot have the same location and spin at the same time).
If you have a set of objects on the exact same spot, any query for a region that contains one should return all - so there's no reason to split them, as it doesn't gain you anything. Simply either count the number of distinct locations when deciding to split, or have each element on the leaf node be an object that encapsulates (coordinates, [list of objects at those coordinates]).

Resources