Choice between interpolation search and binary search - algorithm

I have a static sorted integer array of size N.
There will be a lot of element search queries
(get index of an element if it exists or of the nearest lesser element if it doesn't),
and I have to choose whether to use interpolation search or to stay with old good binary search.
how large is N when there is significant benefit from interpolation search?
what test to use for a check that the array values are uniformly distributed?
are there any other heuristics that can help me with the choice?

Related

Most efficient data structure to use for distributions

Use case:
An element has a value between 0 and Infinity
A user can make a query like -- "Give me the lowest value that is inside the top 30% of the elements"
We can store 1000 of the top elements, and at any point more elements can be added.
What data structure could you use to make this work?
My thoughts where that a vector could work, but might not be the most efficient for insertion and retrieval.
Tentative idea: Keep the array sorted. If the user asks for the lowest element that meets 30% of the distribution, do array.length * 0.7, and return that element.
I would definitely go with a Binary tree, this sorts data very well and allows you to get to what you want relatively quickly.
Arrays are good for insertion or specific indexes, but anytime you are searching for a key, you have a worst-case scenario of O(n), which is not good.
Binary trees have a search algorithm of O(log(n)), which is significantly faster.

Datastructure for fast and efficient search

I have to store the sorted data in a data structure.
The data structure I want to use is heap or binary search tree.
But I am confused which one would better serve the requirement i.e. fast and efficient searching.
----MORE DETAILS---
I am designing an application that receive data from a source(say a data grid) and then store it into a data structure. The data that comes from data GRID station is in the form of sorted digits. The sorted data can be in ascending or descending order.
now I have to search the data. and the process should be efficient and fast.
A heap will only let you search quickly for the minimum element (find it in O(1) time, remove it in O(log n) time). If you design it the other way, it will let you find the maximum, but you don't get both. To search for arbitrary elements quickly (in O(log n) time), you'll want the binary search tree.
For efficient searching, one would definitely prefer a binary search tree.
To search for a value in a heap may require that you search the entire tree - you can't guarantee that some value may not appear on either the left or right subtree (unless one of the children is already greater than the target value, but this isn't guaranteed to happen).
So searching in a heap takes O(n), where-as it takes O(log n) in a (self-balancing) binary search tree.
A heap is only really preferred if you're primarily interested in finding and/or removing the minimum / maximum, along with insertions.
Either can be constructed in O(n) if you're given already-sorted data.
You mentioned a sorted data structure, but in the "more details" in your question I don't really see that a sorted data structure is required (it doesn't matter too much that that's the form in which your data is already in), but it really depends on exactly what type of queries you will do.
If you're only going to search for exact values, you don't really need a sorted data structure, and can use a hash table instead, which supports expected O(1) lookups.
Let me make a list of potential data structures and we'll elaborate:
Binary search tree - it contains sorted data so adding new elements is costly (O(log n) I think). When you search through it you can use the binary search which is O(log n). IT is memory efficient and it doesn't need much additional memory.
Hash table (http://en.wikipedia.org/wiki/Hash_table) - every element is stored with a Hash. You can get element by providing the hash. Your elements don't need to be sortable, they only need to provide hashing method. Accessing elements is O(1) which I suppose is pretty decent one :)
I myself usually use hashtables but it depends on what exactly you need to store and how often you add or delete elements.
Check this also: Advantages of Binary Search Trees over Hash Tables
So in my opinion out of heap and binary search list, use Hash table.
I would go with hash table with separate chaining with an AVLTree (I assume collision occurs). It will work better than O(logn) where n is number of items. After getting the index with hash function, m items will be in this index where m is less than or equal to n. (It is usually much smaller, but never more).
O(1) for hashing and O(logm) for searching in AVLTree. This is faster than binary search for sorted data.

Is binary search optimal in worst case?

Is binary search optimal in worst case? My instructor has said so, but I could not find a book that backs it up. We start with an ordered array, and in worst case(worst case for that algorithm), any algorithm will always take more pairwise comparisons than binary search.
Many people said that the question was unclear. Sorry! So the input is any general sorted array. I am looking for a proof which says that any search algorithm will take at least log2(N) comparisons in worst case(worst case for the algo in consideration).
Yes, binary search is optimal.
This is easily seen by appealing to information theory. It takes log N bits merely to identify a unique element out of N elements. But each comparison only gives you one bit of information. Therefore, you must perform log N comparisons to identify a unique element.
More verbosely... Consider a hypothetical algorithm X that outperforms binary search in the worst case. For a particular element of the array, run the algorithm and record the questions it asks; i.e., the sequence of comparisons it performs. Or rather, record the answers to those questions (like "true, false, false, true").
Convert that sequence into a binary string (1,0,0,1). Call this binary string the "signature of the element with respect to algorithm X". Do this for each element of the array, assigning a "signature" to each element.
Now here is the key. If two elements have the same signature, then algorithm X cannot tell them apart! All the algorithm knows about the array are the answers it gets from the questions it asks; i.e., the comparisons it performs. And if the algorithm cannot tell two elements apart, then it cannot be correct. (Put another way, if two elements have the same signature, meaning they result in the same sequence of comparisons by the algorithm, which one did the algorithm return? Contradiction.)
Finally, prove that if every signature has fewer than log N bits, then there must exist two elements with the same signature (pigeonhole principle). Done.
[update]
One quick additional comment. The above assumes that the algorithm does not know anything about the array except what it learns from performing comparisons. Of course, in real life, sometimes you do know something about the array a priori. As a toy example, if I know that the array has (say) 10 elements all between 1 and 100, and that they are distinct, and that the numbers 92 through 100 are all present in the array... Then clearly I do not need to perform four comparisons even in the worst case.
More realistically, if I know that the elements are uniformly distributed (or roughly uniformly distributed) between their min and their max, again I can do better than binary search.
But in the general case, binary search is still optimal.
Worst case for which algorithm? There's not one universal "worst case." If your question is...
"Is there a case where binary search takes more comparisons than another algorithm?"
Then, yes, of course. A simple linear search takes less time if the element happens to be the first one in the list.
"Is there even an algorithm with a better worst-case running time than binary search?"
Yes, in cases where you know more about the data. For example, a radix tree or trie is at worst constant-time with regard to the number of entries (but linear in length of the key).
"Is there a general search algorithm with a better worst-case running time than binary search?"
If you can only assume you have a comparison function on keys, no, the best worst-case is O(log n). But there are algorithms that are faster, just not in a big-O sense.
... so I suppose you really would have to define the question first!
Binary search has a worst case complexity of O(log(N)) comparisons - which is optimal for a comparison based search of a sorted array.
In some cases it might make sense to do something other than a purely comparison based search - in this case you might be able to beat the O(log(N)) barrier - i.e. check out interpolation search.
It depends on the nature of the data. For example the English language and a dictionary. You could write an algorithm to achieve better than a binary search by making use of the fact that certain letters occur within the English language with different frequencies.
But in general a binary search is a safe bet.
I think the question is a little unclear, but still here are my thoughts.
Worst case of Binary search would be when the element you are searching for is found after all log n comparisons. But the same data can be best case for linear search. It depends on the data arrangement and what you are searching for but the worst case for Binary search would end up to be log n. Now, this cannot be compared with same data and search for linear search since its worst case would be different. The worst case for Linear search could be finding an element which happens to be at the end of the array.
For example: array A = 1, 2, 3, 4, 5, 6 and Binary Search on A for 1 would be the worst case. Whereas for the same array, linear search for 6 would be the worst case, not search for 1.

Data structure to build and lookup set of integer ranges

I have a set of uint32 integers, there may be millions of items in the set. 50-70% of them are consecutive, but in input stream they appear in unpredictable order.
I need to:
Compress this set into ranges to achieve space efficient representation. Already implemented this using trivial algorithm, since ranges computed only once speed is not important here. After this transformation number of resulting ranges is typically within 5 000-10 000, many of them are single-item, of course.
Test membership of some integer, information about specific range in the set is not required. This one must be very fast -- O(1). Was thinking about minimal perfect hash functions, but they do not play well with ranges. Bitsets are very space inefficient. Other structures, like binary trees, has complexity of O(log n), worst thing with them that implementation make many conditional jumps and processor can not predict them well giving poor performance.
Is there any data structure or algorithm specialized in integer ranges to solve this task?
Regarding the second issue:
You could look-up on Bloom Filters. Bloom Filters are specifically designed to answer the membership question in O(1), though the response is either no or maybe (which is not as clear cut as a yes/no :p).
In the maybe case, of course, you need further processing to actually answer the question (unless a probabilistic answer is sufficient in your case), but even so the Bloom Filter may act as a gate keeper, and reject most of the queries outright.
Also, you might want to keep actual ranges and degenerate ranges (single elements) in different structures.
single elements may be best stored in a hash-table
actual ranges can be stored in a sorted array
This diminishes the number of elements stored in the sorted array, and thus the complexity of the binary search performed there. Since you state that many ranges are degenerate, I take it that you only have some 500-1000 ranges (ie, an order of magnitude less), and log(1000) ~ 10
I would therefore suggest the following steps:
Bloom Filter: if no, stop
Sorted Array of real ranges: if yes, stop
Hash Table of single elements
The Sorted Array test is performed first, because from the number you give (millions of number coalesced in a a few thousands of ranges) if a number is contained, chances are it'll be in a range rather than being single :)
One last note: beware of O(1), while it may seem appealing, you are not here in an asymptotic case. Barely 5000-10000 ranges is few, as log(10000) is something like 13. So don't pessimize your implementation by getting a O(1) solution with such a high constant factor that it actually runs slower than a O(log N) solution :)
If you know in advance what the ranges are, then you can check whether a given integer is present in one of the ranges in O(lg n) using the strategy outlined below. It's not O(1), but it's still quite fast in practice.
The idea behind this approach is that if you've merged all of the ranges together, you have a collection of disjoint ranges on the number line. From there, you can define an ordering on those intervals by saying that the interval [a, b] ≤ [c, d] iff b ≤ c. This is a total ordering because all of the ranges are disjoint. You can thus put all of the intervals together into a static array and then sort them by this ordering. This means that the leftmost interval is in the first slot of the array, and the rightmost interval is in the rightmost slot. This construction takes O(n lg n) time.
To check if a some interval contains a given integer, you can do a binary search on this array. Starting at the middle interval, check if the integer is contained in that interval. If so, you're done. Otherwise, if the value is less than the smallest value in the range, continue the search on the left, and if the value is greater than the largest value in the range, continue the search on the right. This is essentially a standard binary search, and it should run in O(lg n) time.
Hope this helps!
AFAIK there is no such algorithm that search over integer list in O(1).
One only can do O(1) search with vast amount of memory.
So it is not very promising to try to find O(1) search algorithm over list of range of integer.
On the other hand, you could try time/memory trade-off approach by carefully examining your data set (eventually building a kind of hash table).
You can use y-fast trees or van Emde Boas trees to achieve O(lg w) time queries, where w is the number of bits in a word, and you can use fusion trees to achieve O(lg_w n) time queries. The optimal tradeoff in terms of n is O(sqrt(lg(n))).
The easiest of these to implement is probably y-fast trees. They are probably faster than doing binary search, though they require roughly O(lg w) = O(lg 32) = O(5) hash table queries, while binary search requires roughly O(lg n) = O(lg 10000) = O(13) comparisons, so binary search may be faster.
Rather than a 'comparison' based storage/retrieval ( which will always be O(log(n)) ),
You need to work on 'radix' based storage/retrieval .
In other words .. extract nibbles from the uint32, and make a trie ..
Keep your ranges into a sorted array and use binary search for lookups.
It's easy to implement, O(log N), and uses less memory and needs less memory accesses than any other tree based approach, so it will probably be also much faster.
From the description of you problem it sounds like the following might be a good compromise. I've described it using an Object oriented language, but is easily convertible to C using a union type or structure with a type member and a pointer.
Use the first 16 bits to index an array of objects (of size 65536). In that array there are 5 possible objects
a NONE object means no elements beginning with those 16bits are in the set
an ALL object means all elements beginning with 16 bits are in the set
a RANGE object means all items with the final 16bits between an upper and lower bound are in the set
a SINGLE object means just one element beginning with the 16bits is in the array
a BITSET object handles all remaining cases with a 65536 bit bitset
Of course, you don't need to split at 16bits, you can adjust to reflect the statistics of your set. In fact you don't need to use consecutive bits, but it speeds up the bit twiddling, and if many of your elements are consecutive as you claim will give good properties.
Hopefully this makes sense, please comment if I need to explain more fully. Effectively you've combined a depth 2 binary tree with a ranges and a bitset for a time/speed tradeoff. If you need to save memory then make the tree deeper with a corresponding slight increase in lookup time.

Fast algorithm for finding next smallest and largest number in a set

I have a set of positive numbers. Given a number not in the set, I want to find the next smallest and next largest numbers that are in the set. The only way I can think to do it now is to find the next smallest by decreasing by 1 until I find a number in the set, and then do the same for finding the next largest.
Motivation: I have a bunch of data in a hashmap, keyed by dates. I don't have a datapoint for every single date. If I have data for, say, 10/01/2000 as 60 and 10/05/2000 as 68, and I ask for 10/02/2000, I want to linearly interpolate. I should get 62.
It depends on if your set is sorted.
If your set is unsorted then finding the closest (higher and lower) is an O(n) operation and a fairly simple algorithm.
If your set is sorted then you can use a modified bisection search to find the answer in O(log n), which is obviously a lot better particularly on larger sets.
If you're doing this repeatedly it might be worth sorting the set, which incurs an O(n log n) cost that might be once off or not depending on how often the set changes. Some kind of tree sort may help improve future sorts as new items are added.
All this boils down to is binary search, provided you can get your data sorted. There are two options.
Sorted Container
If you keep your numbers in a sorted container, this is pretty easy. Instead of using a HashMap, put the data in a TreeMap, then you can efficiently find the next lower or next higher element. Java even has methods to do exactly what you want:
higherKey(K)
lowerKey(K)
This is efficient because TreeMap uses a red-black tree (a kind of balanced binary search tree) internally. higherKey and lowerKey simply start at the root and traverse the tree to find where your element should go.
I'm not sure what language you're using, but in C++ you would usestd::map, and the analogous methods are:
iterator lower_bound(const key_type& k)
iterator upper_bound(const key_type& k)
Array + Sorting
If you don't want to keep your data sorted all the time, you can always dump your data into an array (or any random access container), use sort, and then use the STL's binary search routines on the array:
lower_bound
upper_bound
In Java the analog would be to dump things into an ArrayList, call Java's sort(), then use binarySearch().
All the search routines here are O(logn) time. The cost of keeping your data sorted is O(nlogn) with either a sorted container or with the array. With a sorted container, the cost is amortized over n insertions; with the array you pay it all at once when you call sort().
If you don't want to sort things at all, you can always use a linear search, but you will pay if you use this a lot, as it's an O(n) algorithm.
Put your data items into a tree, like an AVL tree, a red-black tree, or a B+/B- tree. Then you can search the ordered values.
Sort the numbers, then perform binary search on each key to bisect the set. You can then find which numbers are on either side of your missing key.
Convert the set to a list and sort it, then run a binary search for the number not in the set. The result will be the insertion point, i.e. the position at which the number would be present if it were there. If you call that n, then the element at index n of the sorted list is the next smallest number and the element at index n+1 of the sorted list is the next largest number.
You can also do this by keeping the set in sorted order as you construct it, then it becomes an easy matter to search for the insertion point. This approach is used by e.g. the floorEntry() and ceilingEntry() methods of Java's TreeMap.
Keep your set as a sorted list/array and perform bisection-search: e.g., in Python, a sorted list and the bisect module from the standard Python library match your needs to the hilt.
If you get the keys in an array, you can sort the array and find the index of the last element that is less than the desired element. Then you know the index of the key directly before your desired point, and the next element after that is the one directly after.
That should give you enough to interpolate.
(The data structure used need not be an array, anything that will sort is fine. A balanced binary tree, as suggested by others, would be ideal, especially if you plan to reuse the data later).
Finding the n'th element in an unsorted set is O(n). (Select Algorithm) Although here you can boil it down to a simpler, less general algorithm, if you always want the smallest & next smallest elements. But in general, finding the smallest, second smallest, etc. element within an unsorted list is O(n). (You should have been taught this in your algorithms class...)
Sorting a set, and then indexing the element is O(n log n)
Finding an element in a sorted set is O(log n) (binary search)
If you know that there will always be a data point for, say, each week, then keep your HashMap as it is and do what you suggest... That will be a constant time operation since you will be doing 14 hash table lookups (probing 7 days on each side of your search date), each taking O(1) primitive operations.
If you don't know how dense your data is and you can keep it in RAM, then put it into a balanced tree structure as suggested by many others. But this can be costly if you have very many dates and if you have to load the data over the network from a database.

Resources