I am trying to solve this problem where a new joining peer will be given an index [0,1,2, ... n-1] based on how many peer objects already exist (e.g. 8 exist -> new peer will get index 8).
I want to add these peer objects into a binary tree based on their index. For example, peer 0 joins and it will be the root, then peer 1 & peer 2 will be peer 0's left and right children.
I only need the binary tree to follow the rule that it should have two children.
Here's an example:
0
/ \
1 2
/ \ / \
3 4 5 6
My problem is that I am unsure of how to actually do this insertion to keep the 2 children rule. At first I assumed a normal BST insertion rule would work, but once I actually coded that up, I realized the problem of the pivot/key - I am inserting based on the index. Everything would just become a right child
I am really stuck on this but I think the solution should be a trivial one that I am just unable to see. Any advice?
Edit:
Thank you for the help!
I think I figured out something that will meet my needs so I'll leave it here. I will have an implicit binary tree structure. Peers that join up will get put into a priority queue based on their index. This will signify whether they can have children assigned to them & a peer will be removed from this queue once it has 2 children
A few things that you want to consider:
Why do you want a BST?
BSTs are, as the name implies, primarily for searching. But if you are assigning every new user that joins a unique identifier, then you don't need to use a BST to search for them, because you can access them from an array by index.
A BST would be more useful if, say, each user played a game and earned a particular score. To organize the users in a data structure that would render them easily searchable/organizable by score, you might insert a player into the BST with their score as the key when they finished the game. But for unique identifiers like this, there is no reason fo use a BST. In fact, the data structure that you show there is not a BST. The BST would look like this:
3
/ \
1 5
/ \ / \
0 2 4 6
Is another data structure more appropriate?
If you have gotten a better sense of why a BST is not a useful structure for organizing user IDs, then you should next think about what is is you were actually trying to do. If you were just trying to store all the users in a data structure, a list (array) is totally fine, where the index of the list corresponds to the user id.
If you are instead looking to add some sort of grouping to these users, consider using a hash table. For example, if you wanted to be able to look up a user's friends, you would create a hash table where a user id (key) maps to a list of friends' user ids (value).
Hopefully this has been helpful. If there is any more that I can do to help or if I have not fully understood what you are trying to accomplish, just let me know
UPDATE
So based on the comments above, it seems your confusion is on the distinction between a binary tree and a BST. A binary tree is any tree where each node has <= 2 children, whereas a Binary search tree imposes additional constraints on the values of the nodes' keys. The binary tree structure is what you want, but you don't need it for searching, nor do you want to compare those values.
For any given index, i, the parent node would have got the index (i + 1 >> 1) - 1 and the children nodes the indices (i << 1) + 1 and i + 1 << 1. I don't know if this is of any help, as I'm unsure of the purpose of your endeavour. But this at least means that you could save all your peers in a plain array and use that plain array as a binary tree structure by accessing a node's children by just using the node's index.
Related
Apparently you could do either, but the former is more common.
Why would you choose the latter and how does it work?
I read this: http://www.drdobbs.com/cpp/stls-red-black-trees/184410531; which made me think that they did it. It says:
insert_always is a status variable that tells rb_tree whether multiple instances of the same key value are allowed. This variable is set by the constructor and is used by the STL to distinguish between set and multiset and between map and multimap. set and map can only have one occurrence of a particular key, whereas multiset and multimap can have multiple occurrences.
Although now i think it doesnt necessarily mean that. They might still be using containers.
I'm thinking all the nodes with the same key would have to be in a row, because you either have to store all nodes with the same key on the right side or the left side. So if you store equal nodes to the right and insert 1000 1s and one 2, you'd basically have a linked list, which would ruin the properties of the red black tree.
Is the reason why i can't find much on it that it's just a bad idea?
down side of store as multiple nodes:
expands tree size, which make search slower.
if you want to retrieve all values for key K, you need M*log(N) time, where N is number of total nodes, M is number of values for key K, unless you introduce extra code (which complicates the data structure) to implement linked list for these values. (if storing collection, time complexity only take log(N), and it's simple to implement)
more costly to delete. with multi-node method, you'll need to remove node on every delete, but with collection-storage, you only need to remove node K when the last value of key K is deleted.
Can't think of any good side of multi-node method.
Binary Search trees by definition cannot contain duplicates. If you use them to produce a sorted list throwing out the duplicates would produce an incorrect result.
I am working on an implementation of Red Black trees in PHP when I ran into the duplicate issue. We are going to use the tree for sorting and searching.
I am considering adding an occurrence value to the node data type. When a duplicate is encountered just increment occurrence. When walking the tree to produce output just repeat the value by the number of occurrences. I think I would still have a valid BST and avoid having a whole chain of duplicate values which preserve the optimal search time.
I have a set of items that are supposed to for a balanced binary tree. Each item is of the form (data,parent), data being the useful information and parent being the index of the parent node in the binary tree.
Nodes in the tree are numbered left-to-right, row-by-row, like this:
1
___/ \___
/ \
2 3
_/\_ _/\_
4 5 6 7
These elements come stored in a linked list. How should I order this list such that it's easier for me to build the tree? Each parent node will be referenced (by index) by exactly two child nodes; if I sort these by parent index, the sorting must be stable.
You can sort the list in any stable sort, according to the parent field, in increasing order.
The result will be a list like that:
[(d_1,nil), (d_2,1), (d_3,1) , (d_4,2), (d_5,2), ...(d_i,x), (d_i+1,x) ]
^
the root has no parent...
Note that in this list, since we used a stable sort - for each two pairs (d_i,x), (d_i+1,x) in the sorted list, d_i is the left leaf!
Now, you can populate the tree in breadth-first traversal,
Since it is homework - I still want you to make sure you understand everything by your own. So I do not want to "feed answer". If you have any specific question, please comment - and I will try to edit and explain the relevant parts with more details.
Bonus: The result of this organization is very common way to implement a binary heap structure, which is a complete binary tree, but for performance, we usually store it as an array, which is very similar to the output generated by this approach.
I don't think I understand what exactly are you trying to achieve. You have to write the function that inserts items in the tree. The red-black tree, for example, has the same complexity for insertions, O(log n), no matter how the input data is sorted. Is there a specific implementation that you have to use or a specific speed target that you must reach for inserts?
PS: Sounds like a homework to me :)
It sounds like you want a binary tree that allows you to go from a leaf node to its ancestors, using an array.
Usually sorting a list before putting it into a binary tree causes an unbalanced binary tree, unless you use a treap or other O(logn) datastructure.
The usual way of stashing a (complete) binary tree in an array, is to make node i have two children 2i and 2i+1.
Given this organization (not sorting but organization), you can go to a parent node from a leaf node by dividing the array index by 2 using integer arithmetic which will truncate fractions.
if your binary trees are not always complete, you'll probably be better served by forgetting about using an array, and instead using a more traditional tree structure with pointers/references.
I'm trying to find the width of a directed acyclic graph... as represented by an arbitrarily ordered list of nodes, without even an adjacency list.
The graph/list is for a parallel GNU Make-like workflow manager that uses files as its criteria for execution order. Each node has a list of source files and target files. We have a hash table in place so that, given a file name, the node which produces it can be determined. In this way, we can figure out a node's parents by examining the nodes which generate each of its source files using this table.
That is the ONLY ability I have at this point, without changing the code severely. The code has been in public use for a while, and the last thing we want to do is to change the structure significantly and have a bad release. And no, we don't have time to test rigorously (I am in an academic environment). Ideally we're hoping we can do this without doing anything more dangerous than adding fields to the node.
I'll be posting a community-wiki answer outlining my current approach and its flaws. If anyone wants to edit that, or use it as a starting point, feel free. If there's anything I can do to clarify things, I can answer questions or post code if needed.
Thanks!
EDIT: For anyone who cares, this will be in C. Yes, I know my pseudocode is in some horribly botched Python look-alike. I'm sort of hoping the language doesn't really matter.
I think the "width" you're considering here isn't really what you want - the width depends on how you assign levels to each node where you have some choice. You noticed this when you were deciding whether to assign all sources to level 0 or all sinks to the max level.
Instead, you just want to count the number of nodes and divide by the "critical path length", which is the longest path in the dag. This gives the average parallelism for the graph. It depends only on the graph itself, and it still gives you an indication of how wide the graph is.
To compute the critical path length, just do what you're doing - the critical path length is the maximum level you end up assigning.
In my opinion when you're doing this type of last minute development, its best to keep the new structures separate from the ones you are already using. At this point, if I were pressed by time I would go for a simpler solution.
Create an adjacency matrix for the graph using the parent data (should be easy)
Perform a topological sort using this matrix. (or even use tsort if pressed for time)
Now that you have a topological sort, create an array level, one element for each node.
For each node:
If the node has no parents set its level to 0
Otherwise set it to the minimum of level its parents + 1.
Find the maximum level width.
The question is as Keith Randall asked, is this the right measurement you need?
Here's what I (Platinum Azure, the original author) have so far.
Preparations/augmentations:
Add "children" field to linked list ("DAG") node
Add "level" field to "DAG" node
Add "children_left" field to "DAG" node. This is used to make sure that all children are examined before a parent is examined (in a later stage of the algorithm).
Algorithm:
Find the number of immediate children for all nodes; also, determine leaves by adding nodes with children==0 to list.
for l in L:
l.children = 0
for l in L:
l.level = 0
for p in l.parents:
++p.children
Leaves = []
for l in L:
l.children_left = l.children
if l.children == 0:
Leaves.append(l)
Assign every node a "reverse depth" level. Normally by depth, I mean topologically sort and assign depth=0 to nodes with no parents. However, I'm thinking I need to reverse this, with depth=0 corresponding to leaves. Also, we want to make sure that no node is added to the queue without all its children "looking at it" first (to determine its proper "depth level").
max_level = 0
while !Leaves.empty():
l = Leaves.pop()
for p in l.parents:
--p.children_left
if p.children_left == 0:
/* we only want to append parents with for sure correct level */
Leaves.append(p)
p.level = Max(p.level, l.level + 1)
if p.level > max_level:
max_level = p.level
Now that every node has a level, simply create an array and then go through the list once more to count the number of nodes in each level.
level_count = new int[max_level+1]
for l in L:
++level_count[l.level]
width = Max(level_count)
So that's what I'm thinking so far. Is there a way to improve on it? It's linear time all the way, but it's got like five or six linear scans and there will probably be a lot of cache misses and the like. I have to wonder if there isn't a way to exploit some locality with a better data structure-- without actually changing the underlying code beyond node augmentation.
Any thoughts?
So, here is my little problem.
Let's say I have a list of buckets a0 ... an which respectively contain L <= c0 ... cn < H items. I can decide of the L and H limits. I could even update them dynamically, though I don't think it would help much.
The order of the buckets matter. I can't go and swap them around.
Now, I'd like to index these buckets so that:
I know the total count of items
I can look-up the ith element
I can add/remove items from any bucket and update the index efficiently
Seems easy right ? Seeing these criteria I immediately thought about a Fenwick Tree. That's what they are meant for really.
However, when you think about the use cases, a few other use cases creep in:
if a bucket count drops below L, the bucket must disappear (don't worry about the items yet)
if a bucket count reaches H, then a new bucket must be created because this one is full
I haven't figured out how to edit a Fenwick Tree efficiently: remove / add a node without rebuilding the whole tree...
Of course we could setup L = 0, so that removing would become unecessary, however adding items cannot really be avoided.
So here is the question:
Do you know either a better structure for this index or how to update a Fenwick Tree ?
The primary concern is efficiency, and because I do plan to implement it cache/memory considerations are worth worrying about.
Background:
I am trying to come up with a structure somewhat similar to B-Trees and Ranked Skip Lists but with a localized index. The problem of those two structures is that the index is kept along the data, which is inefficient in term of cache (ie you need to fetch multiple pages from memory). Database implementations suggest that keeping the index isolated from the actual data is more cache-friendly, and thus more efficient.
I have understood your problem as:
Each bucket has an internal order and buckets themselves have an order, so all the elements have some ordering and you need the ith element in that ordering.
To solve that:
What you can do is maintain a 'cumulative value' tree where the leaf nodes (x1, x2, ..., xn) are the bucket sizes. The value of a node is the sum of values of its immediate children. Keeping n a power of 2 will make it simple (you can always pad it with zero size buckets in the end) and the tree will be a complete tree.
Corresponding to each bucket you will maintain a pointer to the corresponding leaf node.
Eg, say the bucket sizes are 2,1,4,8.
The tree will look like
15
/ \
3 12
/ \ / \
2 1 4 8
If you want the total count, read the value of the root node.
If you want to modify some xk (i.e. change correspond bucket size), you can walk up the tree following parent pointers, updating the values.
For instance if you add 4 items to the second bucket it will be (the nodes marked with * are the ones that changed)
19*
/ \
7* 12
/ \ / \
2 5* 4 8
If you want to find the ith element, you walk down the above tree, effectively doing the binary search. You already have a left child and right child count. If i > left child node value of current node, you subtract the left child node value and recurse in the right tree. If i <= left child node value, you go left and recurse again.
Say you wanted to find the 9th element in the above tree:
Since left child of root is 7 < 9.
You subtract 7 from 9 (to get 2) and go right.
Since 2 < 4 (the left child of 12), you go left.
You are at the leaf node corresponding to the third bucket. You now need to pick the second element in that bucket.
If you have to add a new bucket, you double the size of your tree (if needed) by adding a new root, making the existing tree the left child and add a new tree with all zero buckets except the one you added (which we be the leftmost leaf of the new tree). This will be amortized O(1) time for adding a new value to the tree. Caveat is you can only add a bucket at the end, and not anywhere in the middle.
Getting the total count is O(1).
Updating single bucket/lookup of item are O(logn).
Adding new bucket is amortized O(1).
Space usage is O(n).
Instead of a binary tree, you can probably do the same with a B-Tree.
I still hope for answers, however here is what I could come up so far, following #Moron suggestion.
Apparently my little Fenwick Tree idea cannot be easily adapted. It's easy to append new buckets at the end of the fenwick tree, but not in it the middle, so it's kind of a lost cause.
We're left with 2 data structures: Binary Indexed Trees (ironically the very name Fenwick used to describe his structure) and Ranked Skip List.
Typically, this does not separate the data from the index, however we can get this behavior by:
Use indirection: the element held by the node is a pointer to a bucket, not the bucket itself
Use pool allocation so that the index elements, even though allocated independently from one another, are still close in memory which shall helps the cache
I tend to prefer Skip Lists to Binary Trees because they are self-organizing, so I'm spared the trouble of constantly re-balancing my tree.
These structures would allow to get to the ith element in O(log N), I don't know if it's possible to get faster asymptotic performance.
Another interesting implementation detail is I have a pointer to this element, but others might have been inserted/removed, how do I know the rank of my element now?
It's possible if the bucket points back to the node that owns it. But this means that either the node should not move or it should update the bucket's pointer when moved around.
I'm trying to find the definition of a binary search tree and I keep finding different definitions everywhere.
Some say that for any given subtree the left child key is less than or equal to the root.
Some say that for any given subtree the right child key is greater than or equal to the root.
And my old college data structures book says "every element has a key and no two elements have the same key."
Is there a universal definition of a bst? Particularly in regards to what to do with trees with multiple instances of the same key.
EDIT: Maybe I was unclear, the definitions I'm seeing are
1) left <= root < right
2) left < root <= right
3) left < root < right, such that no duplicate keys exist.
Many algorithms will specify that duplicates are excluded. For example, the example algorithms in the MIT Algorithms book usually present examples without duplicates. It is fairly trivial to implement duplicates (either as a list at the node, or in one particular direction.)
Most (that I've seen) specify left children as <= and right children as >. Practically speaking, a BST which allows either of the right or left children to be equal to the root node, will require extra computational steps to finish a search where duplicate nodes are allowed.
It is best to utilize a list at the node to store duplicates, as inserting an '=' value to one side of a node requires rewriting the tree on that side to place the node as the child, or the node is placed as a grand-child, at some point below, which eliminates some of the search efficiency.
You have to remember, most of the classroom examples are simplified to portray and deliver the concept. They aren't worth squat in many real-world situations. But the statement, "every element has a key and no two elements have the same key", is not violated by the use of a list at the element node.
So go with what your data structures book said!
Edit:
Universal Definition of a Binary Search Tree involves storing and search for a key based on traversing a data structure in one of two directions. In the pragmatic sense, that means if the value is <>, you traverse the data structure in one of two 'directions'. So, in that sense, duplicate values don't make any sense at all.
This is different from BSP, or binary search partition, but not all that different. The algorithm to search has one of two directions for 'travel', or it is done (successfully or not.) So I apologize that my original answer didn't address the concept of a 'universal definition', as duplicates are really a distinct topic (something you deal with after a successful search, not as part of the binary search.)
If your binary search tree is a red black tree, or you intend to any kind of "tree rotation" operations, duplicate nodes will cause problems. Imagine your tree rule is this:
left < root <= right
Now imagine a simple tree whose root is 5, left child is nil, and right child is 5. If you do a left rotation on the root you end up with a 5 in the left child and a 5 in the root with the right child being nil. Now something in the left tree is equal to the root, but your rule above assumed left < root.
I spent hours trying to figure out why my red/black trees would occasionally traverse out of order, the problem was what I described above. Hopefully somebody reads this and saves themselves hours of debugging in the future!
All three definitions are acceptable and correct. They define different variations of a BST.
Your college data structure's book failed to clarify that its definition was not the only possible.
Certainly, allowing duplicates adds complexity. If you use the definition "left <= root < right" and you have a tree like:
3
/ \
2 4
then adding a "3" duplicate key to this tree will result in:
3
/ \
2 4
\
3
Note that the duplicates are not in contiguous levels.
This is a big issue when allowing duplicates in a BST representation as the one above: duplicates may be separated by any number of levels, so checking for duplicate's existence is not that simple as just checking for immediate childs of a node.
An option to avoid this issue is to not represent duplicates structurally (as separate nodes) but instead use a counter that counts the number of occurrences of the key. The previous example would then have a tree like:
3(1)
/ \
2(1) 4(1)
and after insertion of the duplicate "3" key it will become:
3(2)
/ \
2(1) 4(1)
This simplifies lookup, removal and insertion operations, at the expense of some extra bytes and counter operations.
In a BST, all values descending on the left side of a node are less than (or equal to, see later) the node itself. Similarly, all values descending on the right side of a node are greater than (or equal to) that node value(a).
Some BSTs may choose to allow duplicate values, hence the "or equal to" qualifiers above. The following example may clarify:
14
/ \
13 22
/ / \
1 16 29
/ \
28 29
This shows a BST that allows duplicates(b) - you can see that to find a value, you start at the root node and go down the left or right subtree depending on whether your search value is less than or greater than the node value.
This can be done recursively with something like:
def hasVal (node, srchval):
if node == NULL:
return false
if node.val == srchval:
return true
if node.val > srchval:
return hasVal (node.left, srchval)
return hasVal (node.right, srchval)
and calling it with:
foundIt = hasVal (rootNode, valToLookFor)
Duplicates add a little complexity since you may need to keep searching once you've found your value, for other nodes of the same value. Obviously that doesn't matter for hasVal since it doesn't matter how many there are, just whether at least one exists. It will however matter for things like countVal, since it needs to know how many there are.
(a) You could actually sort them in the opposite direction should you so wish provided you adjust how you search for a specific key. A BST need only maintain some sorted order, whether that's ascending or descending (or even some weird multi-layer-sort method like all odd numbers ascending, then all even numbers descending) is not relevant.
(b) Interestingly, if your sorting key uses the entire value stored at a node (so that nodes containing the same key have no other extra information to distinguish them), there can be performance gains from adding a count to each node, rather than allowing duplicate nodes.
The main benefit is that adding or removing a duplicate will simply modify the count rather than inserting or deleting a new node (an action that may require re-balancing the tree).
So, to add an item, you first check if it already exists. If so, just increment the count and exit. If not, you need to insert a new node with a count of one then rebalance.
To remove an item, you find it then decrement the count - only if the resultant count is zero do you then remove the actual node from the tree and rebalance.
Searches are also quicker given there are fewer nodes but that may not be a large impact.
For example, the following two trees (non-counting on the left, and counting on the right) would be equivalent (in the counting tree, i.c means c copies of item i):
__14__ ___22.2___
/ \ / \
14 22 7.1 29.1
/ \ / \ / \ / \
1 14 22 29 1.1 14.3 28.1 30.1
\ / \
7 28 30
Removing the leaf-node 22 from the left tree would involve rebalancing (since it now has a height differential of two) the resulting 22-29-28-30 subtree such as below (this is one option, there are others that also satisfy the "height differential must be zero or one" rule):
\ \
22 29
\ / \
29 --> 28 30
/ \ /
28 30 22
Doing the same operation on the right tree is a simple modification of the root node from 22.2 to 22.1 (with no rebalancing required).
In the book "Introduction to algorithms", third edition, by Cormen, Leiserson, Rivest and Stein, a binary search tree (BST) is explicitly defined as allowing duplicates. This can be seen in figure 12.1 and the following (page 287):
"The keys in a binary search tree are always stored in such a way as to satisfy the binary-search-tree property: Let x be a node in a binary search tree. If y is a node in the left subtree of x, then y:key <= x:key. If y is a node in the right subtree of x, then y:key >= x:key."
In addition, a red-black tree is then defined on page 308 as:
"A red-black tree is a binary search tree with one extra bit of storage per node: its color"
Therefore, red-black trees defined in this book support duplicates.
Any definition is valid. As long as you are consistent in your implementation (always put equal nodes to the right, always put them to the left, or never allow them) then you're fine. I think it is most common to not allow them, but it is still a BST if they are allowed and place either left or right.
I just want to add some more information to what #Robert Paulson answered.
Let's assume that node contains key & data. So nodes with the same key might contain different data.
(So the search must find all nodes with the same key)
left <= cur < right
left < cur <= right
left <= cur <= right
left < cur < right && cur contain sibling nodes with the same key.
left < cur < right, such that no duplicate keys exist.
1 & 2. works fine if the tree does not have any rotation-related functions to prevent skewness.
But this form doesn't work with AVL tree or Red-Black tree, because rotation will break the principal.
And even if search() finds the node with the key, it must traverse down to the leaf node for the nodes with duplicate key.
Making time complexity for search = theta(logN)
3. will work well with any form of BST with rotation-related functions.
But the search will take O(n), ruining the purpose of using BST.
Say we have the tree as below, with 3) principal.
12
/ \
10 20
/ \ /
9 11 12
/ \
10 12
If we do search(12) on this tree, even tho we found 12 at the root, we must keep search both left & right child to seek for the duplicate key.
This takes O(n) time as I've told.
4. is my personal favorite. Let's say sibling means the node with the same key.
We can change above tree into below.
12 - 12 - 12
/ \
10 - 10 20
/ \
9 11
Now any search will take O(logN) because we don't have to traverse children for the duplicate key.
And this principal also works well with AVL or RB tree.
Working on a red-black tree implementation I was getting problems validating the tree with multiple keys until I realized that with the red-black insert rotation, you have to loosen the constraint to
left <= root <= right
Since none of the documentation I was looking at allowed for duplicate keys and I didn't want to rewrite the rotation methods to account for it, I just decided to modify my nodes to allow for multiple values within the node, and no duplicate keys in the tree.
Those three things you said are all true.
Keys are unique
To the left are keys less than this one
To the right are keys greater than this one
I suppose you could reverse your tree and put the smaller keys on the right, but really the "left" and "right" concept is just that: a visual concept to help us think about a data structure which doesn't really have a left or right, so it doesn't really matter.
1.) left <= root < right
2.) left < root <= right
3.) left < root < right, such that no duplicate keys exist.
I might have to go and dig out my algorithm books, but off the top of my head (3) is the canonical form.
(1) or (2) only come about when you start to allow duplicates nodes and you put duplicate nodes in the tree itself (rather than the node containing a list).
Duplicate Keys
• What happens if there's more than one data item with
the same key?
– This presents a slight problem in red-black trees.
– It's important that nodes with the same key are distributed on
both sides of other nodes with the same key.
– That is, if keys arrive in the order 50, 50, 50,
• you want the second 50 to go to the right of the first one, and the
third 50 to go to the left of the first one.
• Otherwise, the tree becomes unbalanced.
• This could be handled by some kind of randomizing
process in the insertion algorithm.
– However, the search process then becomes more complicated if
all items with the same key must be found.
• It's simpler to outlaw items with the same key.
– In this discussion we'll assume duplicates aren't allowed
One can create a linked list for each node of the tree that contains duplicate keys and store data in the list.
The elements ordering relation <= is a total order so the relation must be reflexive but commonly a binary search tree (aka BST) is a tree without duplicates.
Otherwise if there are duplicates you need run twice or more the same function of deletion!