The question is pretty simple:
I have a (potentially very unbalanced) tree.
At every iteration, new children are appended to some node.
However, children with values duplicated in their ancestors must be filtered out.
Is there a (hopefully simple) way to maintain this data structure efficiently?
The obvious ways require O(depth(node)) time per append, which I'm trying to avoid.
Use AVL Trees or Binary Search Trees(BST). You have to apply a small logic to avoid duplicates in the AVL/BST. That logic: Use only >,< operator in tree building. never use >=, =< operators. //Pseudo code:
if(present_node_value<new_node_value)
insert_in_left_side
else if(present_node_value>new_node_value)
insert_in_right_side
else // Means duplicate entry
print " Duplicate Entry"
return
Related
I don't understand how binary search trees are always defined as "sorted". I get in an array representation of a binary heap you have a fully sorted array. I haven't seen array representations of binary search trees so hard for me to see them as sorted like an array eg [0,1,2,3,4,5] but rather sorted with respect to each node. What is the right way to think about a BST being "sorted" conceptually?
There are many types of binary search trees. All of them have one thing in common: they satisfy an invariant which enables binary search, namely an order relation by which every element in the tree can be compared to any other element in the tree, in a total preorder.
What does that mean?
Let's consider the typical statement of a BST invariant in a textbook, which states that every node's key is greater than all keys in its left sub-tree, and less than all keys in its right sub-tree. We omit conflict resolution details for keys which compare equal.
What does that BST look like? Here's an example:
The way I would explain it to a class of three-year-olds, is try to collapse all the nodes to the bottom level of the leaves, just let them fall down. Or, for high-schoolers, draw a line from each node/key projecting them on the x-axis. Once you did that, it's obvious the keys are already in (ascending) order.
Is this imaginary and casual observation analogous to our definition of a sorted sequence? Yes, it is. Since the elements of the BST satisfy a total preorder, an in-order traversal of the BST must produce those elements in order (Ex: prove it).
It is equivalent to state that if we had stored a BST's keys, by an in-order traversal, in an array, the array would be sorted.
Therefore, by way of our initial definition of a BST, the in-order traversal is the intuitive way of thinking of one as "sorted".
Does this help? It's a binary heap shown as an array
as far as data structures are concerned (arrays, trees, linked lists, etc), "sorted" means that sequentially going through all it's elements you'll find that their values are ordered according to some rule ( >, <, <=, etc).
For arrays, this is easy to picture because it's a linear data structure.
But trees are not, however, iterating through a BST you will notice that all the element are ordered accoring to the rule left value <= node value < right value ( or something similar); the very definition of a sorted data structure.
It is not "sorted" in the same sense an array might be sorted (and trees, except for heaps, are rarely represented as arrays anyway), but they have a structure that allows you to easily traverse the elements in sorted order: simply traverse the nodes of the BST with a depth-first search, and output each node's value after you've looked at its left child (if any) but before you look at its right child (if any).
By the way, the array in which a heap is stored is almost always not sorted. The heap itself can also not be said to be "sorted", because it does not have enough structure to be able to readily produce the elements in sorted order without destroying the heap by successively removing the top element. For example, although you do know that the top element is larger than both of its children (or smaller, depending on the heap type), you cannot tell in advance which child is smaller than the other.
I was asked this question in an interview. Given are two BST (Binary Search Tree). We need to traverse the two trees in such a way that a merged sorted output is the result. Constraint is that we cannot use extra memory like arrays. I suggested a combined inorder traversal of both the trees. The approach was correct but I got stuck in recursion and was not able to write the code.
Note: We cant merge the two trees into one.
Please someone guide me in this direction.
Thanks in advance.
I am assuming that there are no links to parent or next nodes in the tree, because otherwise this
would be quite easy: Your just iterate your trees by following these links and write your merge algorithm as you would for linked lists.
If you don't have next or parent links, you cannot write simple recursion. You'll need two "recursion" stacks.
You can implement the following structure, which allows you to iterate the each of the trees separately.
class Iterator
{
stack<Node> st;
int item(){
return st.top().item();
}
void advance(){
if (st.top().right != null)
st.push(st.top().right);
// Go as far left as possible
while (st.top().left != null) st.push(st.top().left);
else {
int x = st.top().item();
// we pop until we see a node with a higher value
while(st.top().item()<=x) st.pop();
}
}
};
Then write your merge algorithm using two of these iterators.
You will need O(log n) space, but asymptotically this isn't more than any recursive iteration.
The "simplest" way would be to:
Convert tree A to a doubly linked list (sorted)
Convert tree B to a doubly linked list (sorted)
Traverse the sorted lists printing minimum (easy)
Convert list A to tree A
Convert list B to tree B
You can find algorithms for this steps online.
I don't think doing a parallel traversal of trees is possible. You would need additional information e.g. a visited flag to eliminate left subtree as visited and even then you would run into other problems.
If anyone knows how this would be possible with a parallel traversal I would be happy to know it.
print $ merge (inorder treeA) (inorder treeB)
what's the problem?
(notice, the above is actual Haskell code which actually runs and performs the task). inorder is trivial to implement with recursion. merge is a nearly-standard feature, merging its two argument ordered (non-decreasing) lists, producing an ordered output list, keeping the duplicates.
Because of lazy evaluation and garbage collection, the lists are not actually created - at most one produced element is retained for each tree, and is discarded when the next one is produced, in effect creating iterators for the traversals (each with its own internal state).
Here's the solution (if your language does not support the above, or the equivalent yield mechanism, or the explicit continuations of Scheme which allow to switch between two contexts deep inside control stack each (thus making it possible to have "two recursions" in parallel, as in the above)):
They don't say anything about time complexity, so we can do a recursive traversal of 1st tree, and traverse the 2nd tree anew, for each node of the 1st tree - while saving previous value on 1st. So, we have two consecutive values on 1st tree, and print all values from 2nd tree between them, with fresh recursive traversal, restarting from the top of the 2nd tree for each new pair of values from the 1st tree.
Apparently you could do either, but the former is more common.
Why would you choose the latter and how does it work?
I read this: http://www.drdobbs.com/cpp/stls-red-black-trees/184410531; which made me think that they did it. It says:
insert_always is a status variable that tells rb_tree whether multiple instances of the same key value are allowed. This variable is set by the constructor and is used by the STL to distinguish between set and multiset and between map and multimap. set and map can only have one occurrence of a particular key, whereas multiset and multimap can have multiple occurrences.
Although now i think it doesnt necessarily mean that. They might still be using containers.
I'm thinking all the nodes with the same key would have to be in a row, because you either have to store all nodes with the same key on the right side or the left side. So if you store equal nodes to the right and insert 1000 1s and one 2, you'd basically have a linked list, which would ruin the properties of the red black tree.
Is the reason why i can't find much on it that it's just a bad idea?
down side of store as multiple nodes:
expands tree size, which make search slower.
if you want to retrieve all values for key K, you need M*log(N) time, where N is number of total nodes, M is number of values for key K, unless you introduce extra code (which complicates the data structure) to implement linked list for these values. (if storing collection, time complexity only take log(N), and it's simple to implement)
more costly to delete. with multi-node method, you'll need to remove node on every delete, but with collection-storage, you only need to remove node K when the last value of key K is deleted.
Can't think of any good side of multi-node method.
Binary Search trees by definition cannot contain duplicates. If you use them to produce a sorted list throwing out the duplicates would produce an incorrect result.
I am working on an implementation of Red Black trees in PHP when I ran into the duplicate issue. We are going to use the tree for sorting and searching.
I am considering adding an occurrence value to the node data type. When a duplicate is encountered just increment occurrence. When walking the tree to produce output just repeat the value by the number of occurrences. I think I would still have a valid BST and avoid having a whole chain of duplicate values which preserve the optimal search time.
I have a set of items that are supposed to for a balanced binary tree. Each item is of the form (data,parent), data being the useful information and parent being the index of the parent node in the binary tree.
Nodes in the tree are numbered left-to-right, row-by-row, like this:
1
___/ \___
/ \
2 3
_/\_ _/\_
4 5 6 7
These elements come stored in a linked list. How should I order this list such that it's easier for me to build the tree? Each parent node will be referenced (by index) by exactly two child nodes; if I sort these by parent index, the sorting must be stable.
You can sort the list in any stable sort, according to the parent field, in increasing order.
The result will be a list like that:
[(d_1,nil), (d_2,1), (d_3,1) , (d_4,2), (d_5,2), ...(d_i,x), (d_i+1,x) ]
^
the root has no parent...
Note that in this list, since we used a stable sort - for each two pairs (d_i,x), (d_i+1,x) in the sorted list, d_i is the left leaf!
Now, you can populate the tree in breadth-first traversal,
Since it is homework - I still want you to make sure you understand everything by your own. So I do not want to "feed answer". If you have any specific question, please comment - and I will try to edit and explain the relevant parts with more details.
Bonus: The result of this organization is very common way to implement a binary heap structure, which is a complete binary tree, but for performance, we usually store it as an array, which is very similar to the output generated by this approach.
I don't think I understand what exactly are you trying to achieve. You have to write the function that inserts items in the tree. The red-black tree, for example, has the same complexity for insertions, O(log n), no matter how the input data is sorted. Is there a specific implementation that you have to use or a specific speed target that you must reach for inserts?
PS: Sounds like a homework to me :)
It sounds like you want a binary tree that allows you to go from a leaf node to its ancestors, using an array.
Usually sorting a list before putting it into a binary tree causes an unbalanced binary tree, unless you use a treap or other O(logn) datastructure.
The usual way of stashing a (complete) binary tree in an array, is to make node i have two children 2i and 2i+1.
Given this organization (not sorting but organization), you can go to a parent node from a leaf node by dividing the array index by 2 using integer arithmetic which will truncate fractions.
if your binary trees are not always complete, you'll probably be better served by forgetting about using an array, and instead using a more traditional tree structure with pointers/references.
How would one design a memory efficient system which accepts Items added into it and allows Items to be retrieved given a time interval (i.e. return Items inserted between time T1 and time T2). There is no DB involved. Items stored in-memory. What is the data structure involved and associated algorithm.
Updated:
Assume extremely high insertion rate compared to data query.
You can use a sorted data structure, where key is by time of arrival. Note the following:
items are not remvoed
items are inserted in order [if item i was inserted after item j then key(i)>key(j)].
For this reason, tree is discouraged, since it is "overpower", and insertion in it is O(logn), where you can get an O(1) insertion. I suggest using one of the followings:
(1)Array: the array will be filled up always at its end. When the allocated array is full, reallocate a bigger [double sized] array, and copy existing array to it.
Advantages: good caching is usually expected in arrays, O(1) armotorized insertion, used space is at most 2*elementSize*#elemetns
Disadvantages: high latency: when the array is full, it will take O(n) to add an element, so you need to expect that once in a while, there will be costly operation.
(2)Skip list The skip list also allows you also O(logn) seek and O(1) insertion at the end, but it doesn't have latency issues. However, it will suffer more from cache misses then an array. Space used is on average elementSize*#elements + pointerSize*#elements*2 for a skip list.
Advantages: O(1) insertion, no costly ops.
Distadvantages: bad caching is expected.
Suggestion:
I suggest using an array if latency is not an issue. If it is, you should better use a skip list.
In both, finding the desired interval is:
findInterval(T1,T2):
start <- data.find(T1)
end <- data.find(T2)
for each element in data from T1 to T2:
yield element
Either BTree or Binary Search Tree could be a good in-memory data structure to accomplish the above. Just save the timestamp in each node and you can do a range query.
You can add them all to a simple array and sort them.
Do a binary search to located both T1 and T2. All the array elements between them are what you are looking for.
This is helpful if the searching is done only after all the elements are added. If not you can use an AVL or Red-Black tree
How about a relation interval tree (encode your items as intervals containing only a single element, e.g., [a,a])? Although, it has been said already that the ratio of the anticipated operations matter (a lot actually). But here's my two cents:
I suppose an item X that is inserted at time t(X) is associated with that timestamp, right? Meaning you don't insert an item now which has a timestamp from a week ago or something. If that's the case go for the simple array and do interpolation search or something similar (your items will already be sorted according to the attribute that your query refers to, i.e., the time t(X)).
We already have an answer that suggests trees, but I think we need to be more specific: the only situation in which this is really a good solution is if you are very specific about how you build up the tree (and then I would say it's on par with the skip lists suggested in a different answer; ). The objective is to keep the tree as full as possible to the left - I'll make clearer what that means in the following. Make sure each node has a pointer to its (up to) two children and to its parent and knows the depth of the subtree rooted at that node.
Keep a pointer to the root node so that you are able to do lookups in O(log(n)), and keep a pointer to the last inserted node N (which is necessarily the node with the highest key - its timestamp will be the highest). When you are inserting a node, check how many children N has:
If 0, then replace N with the new node you are inserting and make N its left child. (At this point you'll need to update the tree depth field of at most O(log(n)) nodes.)
If 1, then add the new node as its right child.
If 2, then things get interesting. Go up the tree from N until either you find a node that has only 1 child, or the root. If you find a node with only 1 child (this is necessarily the left child), then add the new node as its new right child. If all nodes up to the root have two children, then the current tree is full. Add the new node as the new root node and the old root node as its left child. Don't change the old tree structure otherwise.
Addendum: in order to make cache behaviour and memory overhead better, the best solution is probably to make a tree or skip list of arrays. Instead of every node having a single time stamp and a single value, make every node have an array of, say, 1024 time stamps and values. When an array fills up you add a new one in the top level data structure, but in most steps you just add a single element to the end of the "current array". This wouldn't affect big-O behaviour with respect to either memory or time, but it would reduce the overhead by a factor of 1024, while latency is still very small.