Datastructure for fast and efficient search - data-structures

I have to store the sorted data in a data structure.
The data structure I want to use is heap or binary search tree.
But I am confused which one would better serve the requirement i.e. fast and efficient searching.
----MORE DETAILS---
I am designing an application that receive data from a source(say a data grid) and then store it into a data structure. The data that comes from data GRID station is in the form of sorted digits. The sorted data can be in ascending or descending order.
now I have to search the data. and the process should be efficient and fast.

A heap will only let you search quickly for the minimum element (find it in O(1) time, remove it in O(log n) time). If you design it the other way, it will let you find the maximum, but you don't get both. To search for arbitrary elements quickly (in O(log n) time), you'll want the binary search tree.

For efficient searching, one would definitely prefer a binary search tree.
To search for a value in a heap may require that you search the entire tree - you can't guarantee that some value may not appear on either the left or right subtree (unless one of the children is already greater than the target value, but this isn't guaranteed to happen).
So searching in a heap takes O(n), where-as it takes O(log n) in a (self-balancing) binary search tree.
A heap is only really preferred if you're primarily interested in finding and/or removing the minimum / maximum, along with insertions.
Either can be constructed in O(n) if you're given already-sorted data.
You mentioned a sorted data structure, but in the "more details" in your question I don't really see that a sorted data structure is required (it doesn't matter too much that that's the form in which your data is already in), but it really depends on exactly what type of queries you will do.
If you're only going to search for exact values, you don't really need a sorted data structure, and can use a hash table instead, which supports expected O(1) lookups.

Let me make a list of potential data structures and we'll elaborate:
Binary search tree - it contains sorted data so adding new elements is costly (O(log n) I think). When you search through it you can use the binary search which is O(log n). IT is memory efficient and it doesn't need much additional memory.
Hash table (http://en.wikipedia.org/wiki/Hash_table) - every element is stored with a Hash. You can get element by providing the hash. Your elements don't need to be sortable, they only need to provide hashing method. Accessing elements is O(1) which I suppose is pretty decent one :)
I myself usually use hashtables but it depends on what exactly you need to store and how often you add or delete elements.
Check this also: Advantages of Binary Search Trees over Hash Tables
So in my opinion out of heap and binary search list, use Hash table.

I would go with hash table with separate chaining with an AVLTree (I assume collision occurs). It will work better than O(logn) where n is number of items. After getting the index with hash function, m items will be in this index where m is less than or equal to n. (It is usually much smaller, but never more).
O(1) for hashing and O(logm) for searching in AVLTree. This is faster than binary search for sorted data.

Related

Why "delete" operation is considered to be "slow" on a sorted array?

I am currently studying algorithms and data structures with the help of the famous Stanford course by Tim Roughgarden. In video 13-1 when explaining Balanced Binary Search Trees he compared them to sorted arrays and mentioned that we do not do deletion on sorted array because it is too slow (I believe he meant "slow in comparison with other operations, that we can run in constant [Select, Min/Max, Pred/Succ], O(log n) [Search, Rank] and O(n) [Output/print] time").
I cannot stop thinking about this statement. Namely I cannot wrap my mind around the following:
Let's say we are given an order statistic or a value of the item we
want to delete from a sorted (ascending) array.
We can most certainly find its position in array using Select or
Search in constant or O(n) time respectively.
We can then remove this item and iterate over the items to the right
of the deleted one, incrementing their indices by one, which will take
O(n) time. [this is me (possibly unsuccessfully) trying to describe
the 'move each of them 1 position to the left' operation]
The whole operation will take linear time - O(n) - in the worst case
scenario.
Key question - Am I thinking in a wrong way? If not, why is it considered slow and undesirable?
You are correct: deleting from an array is slow because you have to move all elements after it one position to the left, so that you can cover the hole you created.
Whether O(n) is considered slow depends on the situation. Deleting from an array is most likely part of a larger, more complex algorithm, e.g. inside a loop. This then would add a factor of n to your final complexity, which is usually bad. Using a tree would only add a factor of log n, and O(n log n) is much better than O(n^2) (asymptotically).
The statement is relative to the specific data structure which is being used to hold the sorted values: A sorted array. This specific data structure would be selected for simplicity, for efficient storage, and for quick searches, but is slow for adding and removing elements from the data structure.
Other data structures which hold sorted values may be selected. For example, a binary tree, or a balanced binary tree, or a trie. Each has different characteristics in terms of operation performance and storage efficiency, and would be selected based on the intended usage.
A sorted array is slow for additions and removals because, on average, these operations require shifting half of the array to make room for a new element (or, respectively, to fill in an emptied cell).
However, on many architectures, the simplicity of the data structure and the speed of shifting means that the data structure is fine for "small" data sets.

Which node data structure to use for a trie

I am using trie for the first time.I wanted to know which is the best data structure to use for a trie while deciding which is the next branch that one is supposed to traverse. I was looking among an array,a hashmap and a linked list.
Each of these options has their advantages and disadvantages.
If you store the child nodes in an array, then you can look up which child to visit extremely efficiently by just indexing into the array. However, the space usage per node will be high: O(|Σ|), where Σ is the set of letters that your words can be formed from, even if most of those children are null.
If you store the child nodes in a linked list, then the time required to find a child will be O(|Σ|), since you may need to scan across all of the nodes of the linked list to find the child you want. On the other hand, the space efficiency will be quite good, because you only store the children that you're using. You could also consider using a fixed-sized array here, which has even better space usage but leads to very expensive insertions and deletions.
If you store the child nodes in a hash table, then the (expected) time to find a child will be O(1) and the memory usage will only be proportional (roughly) to the number of children you have. Interestingly, because you know in advance what values you're going to be hashing, you could consider using a dynamic perfect hash table to ensure worst-case O(1) lookups, at the expense of some precomputation.
Another option would be to store the child nodes in a binary search tree. This gives rise to the ternary search tree data structure. This choice is somewhere between the linked list and hash table options - the space usage is low and you can perform predecessor and successor queries efficiently, but there's a slight increase in the cost of performing a lookup due to the search cost in the BST. If you have a static trie where insertions never occur, you can consider using weight-balanced trees as the BSTs at each point; this gives excellent runtime for searches (O(n + log k), where n is the length of the string to search for and k is the total number of words in the trie).
In short, the array lookups are fastest but its space usage in the worst case is the worst. A statically-sized array has the best memory usage but expensive insertions and deletions. The hash table has decently fast lookups and good memory usage (on average). Binary search trees are somewhere in the middle. I would suggest using the hash table here, though if you put a premium on space and don't care about lookup times the linked list might be better. Also, if your alphabet is small (say, you're making a binary trie), the array overhead won't be too bad and you may want to use that.
Hope this helps!
If you are trying to build trie just for alphabets, I would suggest to use array and then use particia tree (space optimized trie).
http://en.wikipedia.org/wiki/Radix_tree
This will allow you to do fast lookup with array and doesn't waste too much of space if branching factor of certain node is low.

Which search data structure works best for sorted integer data?

I have a sorted integers of over a billion, which data structure do you think can exploited the sorted behavior? Main goal is to search items faster...
Options I can think of --
1) regular Binary Search trees with recursively splitting in the middle approach.
2) Any other balanced Binary search trees should work well, but does not exploit the sorted heuristics..
Thanks in advance..
[Edit]
Insertions and deletions are very rare...
Also, apart from integers I have to store some other information in the nodes, I think plain arrays cant do that unless it is a list right?
This really depends on what operations you want to do on the data.
If you are just searching the data and never inserting or deleting anything, just storing the data in a giant sorted array may be perfectly fine. You could then use binary search to look up elements efficiently in O(log n) time. However, insertions and deletions can be expensive since with a billion integers O(n) will hurt. You could store auxiliary information inside the array itself, if you'd like, by just placing it next to each of the integers.
However, with a billion integers, this may be too memory-intensive and you may want to switch to using a bit vector. You could then do a binary search over the bitvector in time O(log U), where U is the number of bits. With a billion integers, I assume that U and n would be close, so this isn't that much of a penalty. Depending on the machine word size, this could save you anywhere from 32x to 128x memory without causing too much of a performance hit. Plus, this will increase the locality of the binary searches and can improve performance as well. this does make it much slower to actually iterate over the numbers in the list, but it makes insertions and deletions take O(1) time. In order to do this, you'd need to store some secondary structure (perhaps a hash table?) containing the data associated with each of the integers. This isn't too bad, since you can use this sorted bit vector for sorted queries and the unsorted hash table once you've found what you're looking for.
If you also need to add and remove values from the list, a balanced BST can be a good option. However, because you specifically know that you're storing integers, you may want to look at the more complex van Emde Boas tree structure, which supports insertion, deletion, predecessor, successor, find-max, and find-min all in O(log log n) time, which is exponentially faster than binary search trees. The implementation cost of this approach is high, though, since the data structure is notoriously tricky to get right.
Another data structure you might want to explore is a bitwise trie, which has the same time bounds as the sorted bit vector but allows you to store auxiliary data along with each integer. Plus, it's super easy to implement!
Hope this helps!
The best data structure for searching sorted integers is an array.
You can search it with log(N) operations, and it is more compact (less memory overhead) than a tree.
And you don't even have to write any code (so less chance of a bug) -- just use bsearch from your standard library.
With a sorted array the best you can archieve is with an interpolation search, that gives you log(log(n)) average time. It is essentially a binary search but don't divide the array in 2 sub arrays of the same size.
It's really fast and extraordinary easy to implement.
http://en.wikipedia.org/wiki/Interpolation_search
Don't let the worst case O(n) bound scares you, because with 1 billion integers it's pratically impossible to obtain.
O(1) solutions:
Assuming 32-bit integers and a lot of ram:
A lookup table with size 2³² roughly (4 billion elements), where each index corresponds to the number of integers with that value.
Assuming larger integers:
A really big hash table. The usual modulus hash function would be appropriate if you have a decent distribution of the values, if not, you might want to combine the 32-bit strategy with a hash lookup.

Advantages of Binary Search Trees over Hash Tables

What are the advantages of binary search trees over hash tables?
Hash tables can look up any element in Theta(1) time and it is just as easy to add an element....but I'm not sure of the advantages going the other way around.
One advantage that no one else has pointed out is that binary search tree allows you to do range searches efficiently.
In order to illustrate my idea, I want to make an extreme case. Say you want to get all the elements whose keys are between 0 to 5000. And actually there is only one such element and 10000 other elements whose keys are not in the range. BST can do range searches quite efficiently since it does not search a subtree which is impossible to have the answer.
While, how can you do range searches in a hash table? You either need to iterate every bucket space, which is O(n), or you have to look for whether each of 1,2,3,4... up to 5000 exists.
(what about the keys between 0 and 5000 are an infinite set? for example keys can be decimals)
Remember that Binary Search Trees (reference-based) are memory-efficient. They do not reserve more memory than they need to.
For instance, if a hash function has a range R(h) = 0...100, then you need to allocate an array of 100 (pointers-to) elements, even if you are just hashing 20 elements. If you were to use a binary search tree to store the same information, you would only allocate as much space as you needed, as well as some metadata about links.
One "advantage" of a binary tree is that it may be traversed to list off all elements in order. This is not impossible with a Hash table but is not a normal operation one design into a hashed structure.
In addition to all the other good comments:
Hash tables in general have better cache behavior requiring less memory reads compared to a binary tree. For a hash table you normally only incur a single read before you have access to a reference holding your data. The binary tree, if it is a balanced variant, requires something in the order of k * lg(n) memory reads for some constant k.
On the other hand, if an enemy knows your hash-function the enemy can enforce your hash table to make collisions, greatly hampering its performance. The workaround is to choose the hash-function randomly from a family, but a BST does not have this disadvantage. Also, when the hash table pressure grows too much, you often tend to enlargen and reallocate the hash table which may be an expensive operation. The BST has simpler behavior here and does not tend to suddenly allocate a lot of data and do a rehashing operation.
Trees tend to be the ultimate average data structure. They can act as lists, can easily be split for parallel operation, have fast removal, insertion and lookup on the order of O(lg n). They do nothing particularly well, but they don't have any excessively bad behavior either.
Finally, BSTs are much easier to implement in (pure) functional languages compared to hash-tables and they do not require destructive updates to be implemented (the persistence argument by Pascal above).
The main advantages of a binary tree over a hash table is that the binary tree gives you two additional operations you can't do (easily, quickly) with a hash table
find the element closest to (not necessarily equal to) some arbitrary key value (or closest above/below)
iterate through the contents of the tree in sorted order
The two are connected -- the binary tree keeps its contents in a sorted order, so things that require that sorted order are easy to do.
A (balanced) binary search tree also has the advantage that its asymptotic complexity is actually an upper bound, while the "constant" times for hash tables are amortized times: If you have a unsuitable hash function, you could end up degrading to linear time, rather than constant.
A binary tree is slower to search and insert into, but has the very nice feature of the infix traversal which essentially means that you can iterate through the nodes of the tree in a sorted order.
Iterating through the entries of a hash table just doesn't make a lot of sense because they are all scattered in memory.
A hashtable would take up more space when it is first created - it will have available slots for the elements that are yet to be inserted (whether or not they are ever inserted), a binary search tree will only be as big as it needs to be. Also, when a hash-table needs more room, expanding to another structure could be time-consuming, but that might depend on the implementation.
A binary search tree can be implemented with a persistent interface, where a new tree is returned but the old tree continues to exist. Implemented carefully, the old and new trees shares most of their nodes. You cannot do this with a standard hash table.
BSTs also provide the "findPredecessor" and "findSuccessor" operations (To find the next smallest and next largest elements) in O(logn) time, which might also be very handy operations. Hash Table can't provide in that time efficiency.
From Cracking the Coding Interview, 6th Edition
We can implement the hash table with a balanced binary search tree (BST) . This gives us an O(log n) lookup time. The advantage of this is potentially using less space, since we no longer allocate a large array. We can also iterate through the keys in order, which can be useful sometimes.
GCC C++ case study
Let's also get some insight from one of the most important implementations in the world. As we will see, it actually matches out theory perfectly!
As shown at What is the underlying data structure of a STL set in C++?, in GCC 6.4:
std::map uses BST
std::unordered_map uses hashmap
So this already points out to the fact that you can't transverse a hashmap efficiently, which is perhaps the main advantage of a BST.
And then, I also benchmarked insertion times in hash map vs BST vs heap at Heap vs Binary Search Tree (BST) which clearly highlights the key performance characteristics:
BST insertion is O(log), hashmap is O(1). And in this particular implementation, hashmap is almost always faster than BST, even for relatively small sizes
hashmap, although much faster in general, has some extremely slow insertions visible as single points in the zoomed out plot.
These happen when the implementation decides that it is time to increase its size, and it needs to be copied over to a larger one.
In more precise terms, this is because only its amortized complexity is O(1), not the worst case, which is actually O(n) during the array copy.
This might make hashmaps inadequate for certain real-time applications, where you need stronger time guarantees.
Related:
Binary Trees vs. Linked Lists vs. Hash Tables
https://cs.stackexchange.com/questions/270/hash-tables-versus-binary-trees
If you want to access the data in a sorted manner, then a sorted list has to be maintained in parallel to the hash table. A good example is Dictionary in .Net. (see http://msdn.microsoft.com/en-us/library/3fcwy8h6.aspx).
This has the side-effect of not only slowing inserts, but it consumes a larger amount of memory than a b-tree.
Further, since a b-tree is sorted, it is simple to find ranges of results, or to perform unions or merges.
It also depends on the use, Hash allows to locate exact match. If you want to query for a range then BST is the choice. Suppose you have a lots of data e1, e2, e3 ..... en.
With hash table you can locate any element in constant time.
If you want to find range values greater than e41 and less than e8, BST can quickly find that.
The key thing is the hash function used to avoid a collision. Of course, we cannot totally avoid a collision, in which case we resort to chaining or other methods. This makes retrieval no longer constant time in worst cases.
Once full, hash table has to increase its bucket size and copy over all the elements again. This is an additional cost not present over BST.
Binary search trees are good choice to implement dictionary if the keys have some total order (keys are comparable) defined on them and you want to preserve the order information.
As BST preserves the order information, it provides you with four additional dynamic set operations that cannot be performed (efficiently) using hash tables. These operations are:
Maximum
Minimum
Successor
Predecessor
All these operations like every BST operation have time complexity of O(H). Additionally all the stored keys remain sorted in the BST thus enabling you to get the sorted sequence of keys just by traversing the tree in in-order.
In summary if all you want is operations insert, delete and remove then hash table is unbeatable (most of the time) in performance. But if you want any or all the operations listed above you should use a BST, preferably a self-balancing BST.
A hashmap is a set associative array. So, your array of input values gets pooled into buckets. In an open addressing scheme, you have a pointer to a bucket, and each time you add a new value into a bucket, you find out where in the bucket there are free spaces. There are a few ways to do this- you start at the beginning of the bucket and increment the pointer each time and test whether its occupied. This is called linear probing. Then, you can do a binary search like add, where you double the difference between the beginning of the bucket and where you double up or back down each time you are searching for a free space. This is called quadratic probing.
OK. Now the problems in both these methods is that if the bucket overflows into the next buckets address, then you need to-
Double each buckets size- malloc(N buckets)/change the hash function-
Time required: depends on malloc implementation
Transfer/Copy each of the earlier buckets data into the new buckets data. This is an O(N) operation where N represents the whole data
OK. but if you use a linkedlist there shouldn't be such a problem right? Yes, In linked lists you don't have this problem. Considering each bucket to begin with a linked list, and if you have 100 elements in a bucket it requires you to traverse those 100 elements to reach the end of the linkedlist hence the List.add(Element E) will take time to-
Hash the element to a bucket- Normal as in all implementations
Take time to find the last element in said bucket- O(N) operation.
The advantage of the linkedlist implementation is that you don't need the memory allocation operation and O(N) transfer/copy of all buckets as in the case of the open addressing implementation.
So, the way to minimize the O(N) operation is to convert the implementation to that of a Binary Search Tree where find operations are O(log(N)) and you add the element in its position based on it's value. The added feature of a BST is that it comes sorted!
Hash Tables are not good for indexing. When you are searching for a range, BSTs are better. That's the reason why most database indexes use B+ trees instead of Hash Tables
Binary search trees can be faster when used with string keys. Especially when strings are long.
Binary search trees using comparisons for less/greater which are fast for strings (when they are not equal). So a BST can quickly answer when a string is not found.
When it's found it will need to do only one full comparison.
In a hash table. You need to calculate the hash of the string and this means you need to go through all bytes at least once to compute the hash. Then again, when a matching entry is found.

Fast algorithm for finding next smallest and largest number in a set

I have a set of positive numbers. Given a number not in the set, I want to find the next smallest and next largest numbers that are in the set. The only way I can think to do it now is to find the next smallest by decreasing by 1 until I find a number in the set, and then do the same for finding the next largest.
Motivation: I have a bunch of data in a hashmap, keyed by dates. I don't have a datapoint for every single date. If I have data for, say, 10/01/2000 as 60 and 10/05/2000 as 68, and I ask for 10/02/2000, I want to linearly interpolate. I should get 62.
It depends on if your set is sorted.
If your set is unsorted then finding the closest (higher and lower) is an O(n) operation and a fairly simple algorithm.
If your set is sorted then you can use a modified bisection search to find the answer in O(log n), which is obviously a lot better particularly on larger sets.
If you're doing this repeatedly it might be worth sorting the set, which incurs an O(n log n) cost that might be once off or not depending on how often the set changes. Some kind of tree sort may help improve future sorts as new items are added.
All this boils down to is binary search, provided you can get your data sorted. There are two options.
Sorted Container
If you keep your numbers in a sorted container, this is pretty easy. Instead of using a HashMap, put the data in a TreeMap, then you can efficiently find the next lower or next higher element. Java even has methods to do exactly what you want:
higherKey(K)
lowerKey(K)
This is efficient because TreeMap uses a red-black tree (a kind of balanced binary search tree) internally. higherKey and lowerKey simply start at the root and traverse the tree to find where your element should go.
I'm not sure what language you're using, but in C++ you would usestd::map, and the analogous methods are:
iterator lower_bound(const key_type& k)
iterator upper_bound(const key_type& k)
Array + Sorting
If you don't want to keep your data sorted all the time, you can always dump your data into an array (or any random access container), use sort, and then use the STL's binary search routines on the array:
lower_bound
upper_bound
In Java the analog would be to dump things into an ArrayList, call Java's sort(), then use binarySearch().
All the search routines here are O(logn) time. The cost of keeping your data sorted is O(nlogn) with either a sorted container or with the array. With a sorted container, the cost is amortized over n insertions; with the array you pay it all at once when you call sort().
If you don't want to sort things at all, you can always use a linear search, but you will pay if you use this a lot, as it's an O(n) algorithm.
Put your data items into a tree, like an AVL tree, a red-black tree, or a B+/B- tree. Then you can search the ordered values.
Sort the numbers, then perform binary search on each key to bisect the set. You can then find which numbers are on either side of your missing key.
Convert the set to a list and sort it, then run a binary search for the number not in the set. The result will be the insertion point, i.e. the position at which the number would be present if it were there. If you call that n, then the element at index n of the sorted list is the next smallest number and the element at index n+1 of the sorted list is the next largest number.
You can also do this by keeping the set in sorted order as you construct it, then it becomes an easy matter to search for the insertion point. This approach is used by e.g. the floorEntry() and ceilingEntry() methods of Java's TreeMap.
Keep your set as a sorted list/array and perform bisection-search: e.g., in Python, a sorted list and the bisect module from the standard Python library match your needs to the hilt.
If you get the keys in an array, you can sort the array and find the index of the last element that is less than the desired element. Then you know the index of the key directly before your desired point, and the next element after that is the one directly after.
That should give you enough to interpolate.
(The data structure used need not be an array, anything that will sort is fine. A balanced binary tree, as suggested by others, would be ideal, especially if you plan to reuse the data later).
Finding the n'th element in an unsorted set is O(n). (Select Algorithm) Although here you can boil it down to a simpler, less general algorithm, if you always want the smallest & next smallest elements. But in general, finding the smallest, second smallest, etc. element within an unsorted list is O(n). (You should have been taught this in your algorithms class...)
Sorting a set, and then indexing the element is O(n log n)
Finding an element in a sorted set is O(log n) (binary search)
If you know that there will always be a data point for, say, each week, then keep your HashMap as it is and do what you suggest... That will be a constant time operation since you will be doing 14 hash table lookups (probing 7 days on each side of your search date), each taking O(1) primitive operations.
If you don't know how dense your data is and you can keep it in RAM, then put it into a balanced tree structure as suggested by many others. But this can be costly if you have very many dates and if you have to load the data over the network from a database.

Resources