Removing multiple items from balancing binary tree at once - algorithm

I'm using a red-black binary tree with linked leafs on a project (Java's TreeMap), to quickly find and iterate through the items. The problem is that I can easily get 35000 items or so on the tree, and several times I have to remove "all items above X", which can be almost the entire tree (like 30000 items at once, because all of them are bigger than X), and that takes too much time to remove them and rebalance the tree each time.
Is there any algorithm that can help me here (so I can make my own tree implementation)?

You're looking for the split operation on a red/black tree, which takes the red/black tree and some value k and splits it into two red/black trees, one with all keys greater than or equal to k and one with all keys less than k. This can be implemented in O(log n) time if you augment the structure to store some extra information. In your case, since you're using Java, you can just split the tree and discard the root of the tree you don't care about so that the garbage collector can handle it.
Details on how to implement this are given in this paper, starting on page 9. It's implemented in terms of a catenate (or join) operations which combines two trees, but I think the exposition is pretty clear.
Hope this helps!

Related

What search tree should I use when I know all the probabilities of accessing each element

In case if I don't know the probabilities of accessing each element, but I'm sure that some elements will be accessed far more often then the others, I will use Splay tree. What should I use if I already know all the probabilities? I assume that there should be some data structure that is better than splay trees for this case.
I'm trying to imagine all the cases where and when should I use every type of the search trees. Maybe someone can post some links to articles about comparison of all the search trees, and similar structures?
EDIT I'd like to still have O(log n) as the worst case, but in avarage it should be faster. Splay trees are good example, but I'd like to predefine the configuration of this tree.
For example, I have an array of elements to store [a1, a2, .. an], and the probabilities for each element [p1, p2, .. pn], which define how often I will access each element. I can create splay tree, add each element to the splay tree (O(n log n)), and then access them with given probabilities to make the desired tree. So if I have probabilities [1/2, 1/4, 1/4], I need to splay the first element, to make it be among the first. So, I need to order elements by probabilities, and splay them in the order from the lowest to the highest access probability. That takes O(n log n) also. So, overall time of building such tree is O(n log n) with a big constant. My goal is to lower this number.
I do not mind using something else, but not search tree, but I'd like for the time to be lower then in case of Splay tree. And I want search, insert and delete be in the range of O(log n) amortized.
Edit: I didn't see that you wanted to update the tree dynamically - the below algorithm requires all elements and probabilities to be known in advance. I'll leave the post up in case someone in such a situation comes along.
If you happen to be in possession of the third edition of Introduction to Algorithms by Cormen et al., it describes a dynamic programming algorithm for creating optimal binary search trees when you know all of the probabilities.
Here is a rough outline of the algorithm: First, sort the elements (on element value, not probability). We don't yet know which element should be the root of the tree, but we know that all elements that will be to the left of the root in the tree will be to the left of that element in the list, and vice versa for the elements to the right of the root. If we choose the element at index k to be the root, we get two subproblems: how to construct an optimal tree for the elements 0 through k-1, and for the elements k+1 through n-1. Solve these problems recursively, so that you know the expected cost for a search in a tree where the root is element k. Do this for all possible choices of k, and you will find which tree is the best one. Use dynamic programming or memoization in order to save computation time.
Use a hash table.
You never mentioned needing ordered iteration, and by sacrificing this you can achieve amortized O(1) insert/access complexity, better than O(log n).
Specifically, use a hash table with linked list buckets, and use the move-to-front optimization. What this means is each time you search a bucket (linked list) with more than one item, you move the item found to the front of that bucket. The next time you access this element, it will already be at the front.
If you know the access probabilities, you can further refine the technique. When inserting a new element into a bucket, don't insert it onto the front, but rather insert such that you maintain most-probable-first order. Note the move-to-front technique will tend to perform this sort implicitly already, but you can help it bootstrap more quickly.
If your tree is not going to change once created, you probably should use a hash table or tango tree:
http://en.wikipedia.org/wiki/Tango_tree
Hash tables, when not overloaded, are O(1) lookup, degrading to a O(n) when overloaded.
Tango trees, once constructed, are O(loglogn) lookup. They do not support deletion or insertion.
There's also something known as a "perfect hash" that might be good for your use.

Create a balanced binary search tree from a stream of integers

I have just finished a job interview and I was struggling with this question, which seems to me as a very hard question for giving on a 15 minutes interview.
The question was:
Write a function, which given a stream of integers (unordered), builds a balanced search tree.
Now, you can't wait for the input to end (it's a stream), so you need to balance the tree on the fly.
My first answer was to use a Red-Black tree, which of course does the job, but i have to assume they didn't expect me to implement a red black tree in 15 minutes.
So, is there any simple solution for this problem i'm not aware of?
Thanks,
Dave
I personally think that the best way to do this would be to go for a randomized binary search tree like a treap. This doesn't absolutely guarantee that the tree will be balanced, but with high probability the tree will have a good balance factor. A treap works by augmenting each element of the tree with a uniformly random number, then ensuring that the tree is a binary search tree with respect to the keys and a heap with respect to the uniform random values. Insertion into a treap is extremely easy:
Pick a random number to assign to the newly-added element.
Insert the element into the BST using standard BST insertion.
While the newly-inserted element's key is greater than the key of its parent, perform a tree rotation to bring the new element above its parent.
That last step is the only really hard one, but if you had some time to work it out on a whiteboard I'm pretty sure that you could implement this on-the-fly in an interview.
Another option that might work would be to use a splay tree. It's another type of fast BST that can be implemented assuming you have a standard BST insert function and the ability to do tree rotations. Importantly, splay trees are extremely fast in practice, and it's known that they are (to within a constant factor) at least as good as any other static binary search tree.
Depending on what's meant by "search tree," you could also consider storing the integers in some structure optimized for lookup of integers. For example, you could use a bitwise trie to store the integers, which supports lookup in time proportional to the number of bits in a machine word. This can be implemented quite nicely using a recursive function to look over the bits, and doesn't require any sort of rotations. If you needed to blast out an implementation in fifteen minutes, and if the interviewer allows you to deviate from the standard binary search trees, then this might be a great solution.
Hope this helps!
AA Trees are a bit simpler than Red-Black trees, but I couldn't implement one off the top of my head.
One of the simplest balanced binary search tree is BB(α)-tree. You pick the constant α, which says how much unbalanced can the tree get. At all times, #descendants(child) <= (1-α) × #descendants(node) must hold. You treat it as normal binary search tree, but when the formula doesn't apply to some node anymore, you just rebuild that part of the tree from scratch, so that it is perfectly balanced.
The amortized time complexity for insertion or deletion is still O(log N), just as with other balanced binary trees.

What class of problem would one use a binary search tree to solve?

I've seen this data structure talked about a lot, but I am unclear as to what sort of problem would demand such a data structure (over alternative representations). I've never needed one, but perhaps that's because I don't quite grok it. Can you enlighten me?
One example of where you would use a binary search tree would be a sorted list of values where you want to be able to quickly add elements.
Consider using an array for this purpose. You have very fast access to read random values, but if you want to add a new value, you have to find the place in the array where it belongs, shift everything over, and then insert the new value.
With a binary search tree, you simply traverse the tree looking for where the value would be if it were in the tree already, and then add it there.
Also, consider if you want to find out if your sorted array contains a particular value. You have to start at one end of the array and compare the value you're looking for to each individual value until you either find the value in the array, or pass the point where it would have been. With a binary search tree, you greatly reduce the number of comparisons you are likely to have to make. Just a quick caveat, however, it is definitely possible to contrive situations where the binary search tree requires more comparisons, but these are the exception, not the rule.
One thing I've used it for in the past is Huffman decoding (or any variable-bit-length scheme).
If you maintain your binary tree with the characters at the leaves, each incoming bit decides whether you move to the left or right node.
When you reach a leaf node, you have your decoded character and you can start on the next one.
For example, consider the following tree:
.
/ \
. C
/ \
A B
This would be a tree for a file where the predominant letter was C (by having less bits used for common letters, the file is shorter than it would be for a fixed-bit-length scheme). The codes for the individual letters are:
A: 00 (left, left).
B: 01 (left, right).
C: 1 (right).
The class of problems you use then for are those where you want to be able to both insert and access elements reasonably efficiently. As well as unbalanced trees (such as the Huffman example above), you can also use balanced trees which make the insertions a little more costly (since you may have to rebalance on the fly) but make lookups a lot more efficient since you're traversing the minimum possible number of nodes.
from wiki
Self-balancing binary search trees can be used in a natural way to construct and maintain ordered lists, such as priority queues. They can also be used for associative arrays; key-value pairs are simply inserted with an ordering based on the key alone. In this capacity, self-balancing BSTs have a number of advantages and disadvantages over their main competitor, hash tables. One advantage of self-balancing BSTs is that they allow fast (indeed, asymptotically optimal) enumeration of the items in key order, which hash tables do not provide. One disadvantage is that their lookup algorithms get more complicated when there may be multiple items with the same key.
Self-balancing BSTs can be used to implement any algorithm that requires mutable ordered lists, to achieve optimal worst-case asymptotic performance. For example, if binary tree sort is implemented with a self-balanced BST, we have a very simple-to-describe yet asymptotically optimal O(n log n) sorting algorithm. Similarly, many algorithms in computational geometry exploit variations on self-balancing BSTs to solve problems such as the line segment intersection problem and the point location problem efficiently. (For average-case performance, however, self-balanced BSTs may be less efficient than other solutions. Binary tree sort, in particular, is likely to be slower than mergesort or quicksort, because of the tree-balancing overhead as well as cache access patterns.)
Self-balancing BSTs are flexible data structures, in that it's easy to extend them to efficiently record additional information or perform new operations. For example, one can record the number of nodes in each subtree having a certain property, allowing one to count the number of nodes in a certain key range with that property in O(log n) time. These extensions can be used, for example, to optimize database queries or other list-processing algorithms.

Are interval, segment, fenwick trees the same?

Today i listened a lecture about fenwick trees (binary indexed trees) and the teacher says than this tree is a generalization of interval and segment trees, but my implementations of this three data structures are different.
Is this afirmation true? and Why?
The following classification seems sensible although different people are bound to mix these terms up.
Fenwick tree/Binary-indexed tree link
The one where you use a single array and operations on the binary representation to store prefix sums (also called cumulative sums). Elements can be members of a monoid.
Range tree link
The family of trees where each node represents a subrange of a given range, say [0, N]. Used to compute associative operations on intervals.
Interval tree link
Trees where you store actual intervals. Most commonly you take a midpoint, keep the intersecting intervals at the node and repeat the process for the intervals to the left and to the right of the point.
Segment tree link
Similar to a range tree where leaves are elementary intervals in a possibly continuous space rather than discrete and higher nodes are unions of the elementary intervals. Used to check for point inclusion.
I have never heard binary indexed trees called a generalization of anything. It's certainly not a generalization of interval trees and segment trees. I suggest you follow the links to convince yourself of this.
than this tree is a generalization of interval and segment trees
If by "this tree" your teacher meant "the binary indexed tree", then he is wrong.
but my implementations of this three data structures are different
Of course they are different, your teacher never said they shouldn't be. He just said one is a generalization of the other (which isn't true, but still). Either way, the implementations are supposed to be different.
What would have the same implementation is a binary indexed tree and a fenwick tree, because those are the same thing.

Tree Datastructures

I've tried to understand what sorted trees are and binary trees and avl and and and ...
I'm still not sure, what makes a sorted tree sorted? And what is the complexity (Big-Oh) between searching in a sorted and searching in an unsorted tree? Hope you can help me.
Binary Trees
There exists two main types of binary trees, balanced and unbalanced. A balanced tree aims to keep the height of the tree (height = the amount of nodes between the root and the furthest child) as even as possible. There are several types of algorithms for balanced trees, the two most famous being AVL- and RedBlack-trees. The complexity for insert/delete/search operations on both AVL and RedBlack trees is O(log n) or better - which is the important part. Other self balancing algorithms are AA-, Splay- and Scapegoat-tree.
Balanced trees gain their property (and name) of being balanced from the fact that after every delete or insert operation on the tree the algorithm introspects the tree to make sure it's still balanced, if it's not it will try to fix this (which is done differently with each algorithm) by rotating nodes around in the tree.
Normal (or unbalanced) binary trees do not modify their structure to keep themselves balanced and have the risk of, most often overtime, to become very inefficient (especially if the values are inserted in order). However if performance is of no issue and you mainly want a sorted data structure then they might do. The complexity for insert/delete/search operations on an unbalanced tree range from O(1) (best case - if you want the root) to O(n) (worst-case if you inserted all nodes in order and want the largest node)
There exists another variation which is called a randomized binary tree which uses some kind of randomization to make sure the tree doesn't become fully unbalanced (which is the same as a linked list)
A binary search tree is an "tree"-structure where every node has two children-nodes.
The left nodes all have the property of being less than its parent, and the right-nodes are all greater than its parent.
The intressting thing with an binary-tree is that we can search for an value in O(log n) when the tree is properly sorted. Doing the same search in an LinkedList for an example would give us the searchspeed of O(n).
The best way to go about learning datastructures would be to do a day of googling and reading wikipedia articles.
This might get you started
http://en.wikipedia.org/wiki/Binary_search_tree
Do a google search for the following:
site:stackoverflow.com binary trees
to get a list of SO questions which will answer your several questions.
There isn't really a lot of point in using a tree structure if it isn't sorted in some fashion - if you are planning on searching for a node in the tree and it is unsorted, you will have to traverse the entire tree (O(n)). If you have a tree which is sorted in some fashion, then it is only necessary to traverse down a single branch of the tree (typically O(log n)).
In binary tree the right leaf is always smaller then the head, and the left leaf is always bigger, so you can search in sorted tree in O(log(n)), you just need to go right if if the key is smaller than head and to the left if bgger

Resources