how to calculate different groups of one million binary sequences? - algorithm

I have one million binary sequences ,they are in the same length, such as (1000010011,1100110000....) and so on. And I want to know how many different groups they have(same sequences belong to same group ).what is the fastest way?
No stoi please.

Depending on the length L of a sequence:
L < ~20: bucket sort
This is short enough in comparison to the input size. A bucketsort with L buckets is all you need. - preallocate an array of size 2L, since you have ~million sequences and 220 is ~million, you will only need O(n) of additional memory.
Go through your sequence, sort to the buckets
Go through the buckets, count the results. Return them.
And we're done.
The time complexity will be O(n) with O(n) memory cost. This is optimal complexity-wise since you have to visit every element at least once to check its value anyway.
L reasonably large: hash table
If you pick a reasonable hashing function and a good size of the hash table(or a dictionary if we need to store the counts)1 you will have small number of collisions while inserting. The amortized time will be O(n) since if the hash is good, then the insert is amortized O(1).
As a side note, the bucket sort is technically a perfect hash since the hash function in this case is an one-to-one function.
L unreasonably large: binary tree
if for some reason the construction of hash is not feasible or you wish for consistency then building a binary tree to hold the values is a way to go.
This will take O(nlog(n)) as binary trees usually do.
1 ~2M should be enough and it is still O(n). Maybe you could go even lower to around 1,5M size.

Related

Find the 10,000 largest out of 1,000,000 total values

I have a file that has 1,000,000 float values in it. I need to find the 10,000 largest values.
I was thinking of:
Reading the file
Converting the strings to floats
Placing the floats into a max-heap (a heap where the largest value is the root)
After all values are in the heap, removing the root 10,000 times and adding those values to a list/arraylist.
I know I will have
1,000,000 inserts into the heap
10,000 removals from the heap
10,000 inserts into the return list
Would this be a good solution? This is for a homework assignment.
Your solution is mostly good. It's basically a heapsort that stops after getting K elements, which improves the running time from O(NlogN) (for a full sort) to O(N + KlogN). Here N = 1000000 and K = 10000.
However, you should not do N inserts to the heap initially, as this would take O(NlogN) - instead, use a heapify operation which turns an array to a heap in linear time.
If the K numbers don't need to be sorted, you can find the Kth largest number in linear time using a selection algorithm, and then output all numbers larger than it. This gives an O(n) solution.
How about using mergesort(log n operations in worst case scenario) to sort the 1,000,000 integers into an array then get the last 10000 directly?
Sorting is expensive, and your input set is not small. Fortunately, you don't care about order. All you need is to know that you have the top X numbers. So, don't sort.
How would you do this problem if, instead of looking for the top 10,000 out of 1,000,000, you were looking for the top 1 (i.e. the single largest value) out of 100? You'd only need to keep track of the largest value you'd seen so far, and compare it to the next number and the next one until you found a larger one or you ran out of input. Could you expand that idea back out to the input size you're looking at? What would be the big-O (hint: you'd only be looking at each input number one time)?
Final note since you said this was homework: if you've just been learning about heaps in class, and you think your teacher/professor is looking for a heap solution, then yes, your idea is good.
Could you merge sort the values in the array after you have read them all in? This is a fast way to sort the values. Then you could request your_array[10000] and you would know that it is the 10000th largest. Merge sort sounds like what you want. Also if you really need speed, you could look into format your values for radix sort, that would take a bit of formatting but it sounds like that would be the absolute fastest way to solve this problem.

Which search data structure works best for sorted integer data?

I have a sorted integers of over a billion, which data structure do you think can exploited the sorted behavior? Main goal is to search items faster...
Options I can think of --
1) regular Binary Search trees with recursively splitting in the middle approach.
2) Any other balanced Binary search trees should work well, but does not exploit the sorted heuristics..
Thanks in advance..
[Edit]
Insertions and deletions are very rare...
Also, apart from integers I have to store some other information in the nodes, I think plain arrays cant do that unless it is a list right?
This really depends on what operations you want to do on the data.
If you are just searching the data and never inserting or deleting anything, just storing the data in a giant sorted array may be perfectly fine. You could then use binary search to look up elements efficiently in O(log n) time. However, insertions and deletions can be expensive since with a billion integers O(n) will hurt. You could store auxiliary information inside the array itself, if you'd like, by just placing it next to each of the integers.
However, with a billion integers, this may be too memory-intensive and you may want to switch to using a bit vector. You could then do a binary search over the bitvector in time O(log U), where U is the number of bits. With a billion integers, I assume that U and n would be close, so this isn't that much of a penalty. Depending on the machine word size, this could save you anywhere from 32x to 128x memory without causing too much of a performance hit. Plus, this will increase the locality of the binary searches and can improve performance as well. this does make it much slower to actually iterate over the numbers in the list, but it makes insertions and deletions take O(1) time. In order to do this, you'd need to store some secondary structure (perhaps a hash table?) containing the data associated with each of the integers. This isn't too bad, since you can use this sorted bit vector for sorted queries and the unsorted hash table once you've found what you're looking for.
If you also need to add and remove values from the list, a balanced BST can be a good option. However, because you specifically know that you're storing integers, you may want to look at the more complex van Emde Boas tree structure, which supports insertion, deletion, predecessor, successor, find-max, and find-min all in O(log log n) time, which is exponentially faster than binary search trees. The implementation cost of this approach is high, though, since the data structure is notoriously tricky to get right.
Another data structure you might want to explore is a bitwise trie, which has the same time bounds as the sorted bit vector but allows you to store auxiliary data along with each integer. Plus, it's super easy to implement!
Hope this helps!
The best data structure for searching sorted integers is an array.
You can search it with log(N) operations, and it is more compact (less memory overhead) than a tree.
And you don't even have to write any code (so less chance of a bug) -- just use bsearch from your standard library.
With a sorted array the best you can archieve is with an interpolation search, that gives you log(log(n)) average time. It is essentially a binary search but don't divide the array in 2 sub arrays of the same size.
It's really fast and extraordinary easy to implement.
http://en.wikipedia.org/wiki/Interpolation_search
Don't let the worst case O(n) bound scares you, because with 1 billion integers it's pratically impossible to obtain.
O(1) solutions:
Assuming 32-bit integers and a lot of ram:
A lookup table with size 2³² roughly (4 billion elements), where each index corresponds to the number of integers with that value.
Assuming larger integers:
A really big hash table. The usual modulus hash function would be appropriate if you have a decent distribution of the values, if not, you might want to combine the 32-bit strategy with a hash lookup.

Data structure to build and lookup set of integer ranges

I have a set of uint32 integers, there may be millions of items in the set. 50-70% of them are consecutive, but in input stream they appear in unpredictable order.
I need to:
Compress this set into ranges to achieve space efficient representation. Already implemented this using trivial algorithm, since ranges computed only once speed is not important here. After this transformation number of resulting ranges is typically within 5 000-10 000, many of them are single-item, of course.
Test membership of some integer, information about specific range in the set is not required. This one must be very fast -- O(1). Was thinking about minimal perfect hash functions, but they do not play well with ranges. Bitsets are very space inefficient. Other structures, like binary trees, has complexity of O(log n), worst thing with them that implementation make many conditional jumps and processor can not predict them well giving poor performance.
Is there any data structure or algorithm specialized in integer ranges to solve this task?
Regarding the second issue:
You could look-up on Bloom Filters. Bloom Filters are specifically designed to answer the membership question in O(1), though the response is either no or maybe (which is not as clear cut as a yes/no :p).
In the maybe case, of course, you need further processing to actually answer the question (unless a probabilistic answer is sufficient in your case), but even so the Bloom Filter may act as a gate keeper, and reject most of the queries outright.
Also, you might want to keep actual ranges and degenerate ranges (single elements) in different structures.
single elements may be best stored in a hash-table
actual ranges can be stored in a sorted array
This diminishes the number of elements stored in the sorted array, and thus the complexity of the binary search performed there. Since you state that many ranges are degenerate, I take it that you only have some 500-1000 ranges (ie, an order of magnitude less), and log(1000) ~ 10
I would therefore suggest the following steps:
Bloom Filter: if no, stop
Sorted Array of real ranges: if yes, stop
Hash Table of single elements
The Sorted Array test is performed first, because from the number you give (millions of number coalesced in a a few thousands of ranges) if a number is contained, chances are it'll be in a range rather than being single :)
One last note: beware of O(1), while it may seem appealing, you are not here in an asymptotic case. Barely 5000-10000 ranges is few, as log(10000) is something like 13. So don't pessimize your implementation by getting a O(1) solution with such a high constant factor that it actually runs slower than a O(log N) solution :)
If you know in advance what the ranges are, then you can check whether a given integer is present in one of the ranges in O(lg n) using the strategy outlined below. It's not O(1), but it's still quite fast in practice.
The idea behind this approach is that if you've merged all of the ranges together, you have a collection of disjoint ranges on the number line. From there, you can define an ordering on those intervals by saying that the interval [a, b] ≤ [c, d] iff b ≤ c. This is a total ordering because all of the ranges are disjoint. You can thus put all of the intervals together into a static array and then sort them by this ordering. This means that the leftmost interval is in the first slot of the array, and the rightmost interval is in the rightmost slot. This construction takes O(n lg n) time.
To check if a some interval contains a given integer, you can do a binary search on this array. Starting at the middle interval, check if the integer is contained in that interval. If so, you're done. Otherwise, if the value is less than the smallest value in the range, continue the search on the left, and if the value is greater than the largest value in the range, continue the search on the right. This is essentially a standard binary search, and it should run in O(lg n) time.
Hope this helps!
AFAIK there is no such algorithm that search over integer list in O(1).
One only can do O(1) search with vast amount of memory.
So it is not very promising to try to find O(1) search algorithm over list of range of integer.
On the other hand, you could try time/memory trade-off approach by carefully examining your data set (eventually building a kind of hash table).
You can use y-fast trees or van Emde Boas trees to achieve O(lg w) time queries, where w is the number of bits in a word, and you can use fusion trees to achieve O(lg_w n) time queries. The optimal tradeoff in terms of n is O(sqrt(lg(n))).
The easiest of these to implement is probably y-fast trees. They are probably faster than doing binary search, though they require roughly O(lg w) = O(lg 32) = O(5) hash table queries, while binary search requires roughly O(lg n) = O(lg 10000) = O(13) comparisons, so binary search may be faster.
Rather than a 'comparison' based storage/retrieval ( which will always be O(log(n)) ),
You need to work on 'radix' based storage/retrieval .
In other words .. extract nibbles from the uint32, and make a trie ..
Keep your ranges into a sorted array and use binary search for lookups.
It's easy to implement, O(log N), and uses less memory and needs less memory accesses than any other tree based approach, so it will probably be also much faster.
From the description of you problem it sounds like the following might be a good compromise. I've described it using an Object oriented language, but is easily convertible to C using a union type or structure with a type member and a pointer.
Use the first 16 bits to index an array of objects (of size 65536). In that array there are 5 possible objects
a NONE object means no elements beginning with those 16bits are in the set
an ALL object means all elements beginning with 16 bits are in the set
a RANGE object means all items with the final 16bits between an upper and lower bound are in the set
a SINGLE object means just one element beginning with the 16bits is in the array
a BITSET object handles all remaining cases with a 65536 bit bitset
Of course, you don't need to split at 16bits, you can adjust to reflect the statistics of your set. In fact you don't need to use consecutive bits, but it speeds up the bit twiddling, and if many of your elements are consecutive as you claim will give good properties.
Hopefully this makes sense, please comment if I need to explain more fully. Effectively you've combined a depth 2 binary tree with a ranges and a bitset for a time/speed tradeoff. If you need to save memory then make the tree deeper with a corresponding slight increase in lookup time.

Is partitioning easier than sorting?

This is a question that's been lingering in my mind for some time ...
Suppose I have a list of items and an equivalence relation on them, and comparing two items takes constant time.
I want to return a partition of the items, e.g. a list of linked lists, each containing all equivalent items.
One way of doing this is to extend the equivalence to an ordering on the items and order them (with a sorting algorithm); then all equivalent items will be adjacent.
But can it be done more efficiently than with sorting? Is the time complexity of this problem lower than that of sorting? If not, why not?
You seem to be asking two different questions at one go here.
1) If allowing only equality checks, does it make partition easier than if we had some ordering? The answer is, no. You require Omega(n^2) comparisons to determine the partitioning in the worst case (all different for instance).
2) If allowing ordering, is partitioning easier than sorting? The answer again is no. This is because of the Element Distinctness Problem. Which says that in order to even determine if all objects are distinct, you require Omega(nlogn) comparisons. Since sorting can be done in O(nlogn) time (and also have Omega(nlogn) lower bounds) and solves the partition problem, asymptotically they are equally hard.
If you pick an arbitrary hash function, equal objects need not have the same hash, in which case you haven't done any useful work by putting them in a hashtable.
Even if you do come up with such a hash (equal objects guaranteed to have the same hash), the time complexity is expected O(n) for good hashes, and worst case is Omega(n^2).
Whether to use hashing or sorting completely depends on other constraints not available in the question.
The other answers also seem to be forgetting that your question is (mainly) about comparing partitioning and sorting!
If you can define a hash function for the items as well as an equivalence relation, then you should be able to do the partition in linear time -- assuming computing the hash is constant time. The hash function must map equivalent items to the same hash value.
Without a hash function, you would have to compare every new item to be inserted into the partitioned lists against the head of each existing list. The efficiency of that strategy depends on how many partitions there will eventually be.
Let's say you have 100 items, and they will eventually be partitioned into 3 lists. Then each item would have to be compared against at most 3 other items before inserting it into one of the lists.
However, if those 100 items would eventually be partitioned into 90 lists (i.e., very few equivalent items), it's a different story. Now your runtime is closer to quadratic than linear.
If you don't care about the final ordering of the equivalence sets, then partitioning into equivalence sets could be quicker. However, it depends on the algorithm and the numbers of elements in each set.
If there are very few items in each set, then you might as well just sort the elements and then find the adjacent equal elements. A good sorting algorithm is O(n log n) for n elements.
If there are a few sets with lots of elements in each then you can take each element, and compare to the existing sets. If it belongs in one of them then add it, otherwise create a new set. This will be O(n*m) where n is the number of elements, and m is the number of equivalence sets, which is less then O(n log n) for large n and small m, but worse as m tends to n.
A combined sorting/partitioning algorithm may be quicker.
If a comparator must be used, then the lower bound is Ω(n log n) comparisons for sorting or partitioning. The reason is all elements must be inspected Ω(n), and a comparator must perform log n comparisons for each element to uniquely identify or place that element in relation to the others (each comparison divides the space in 2, and so for a space of size n, log n comparisons are needed.)
If each element can be associated with a unique key which is derived in constant time, then the lowerbound is Ω(n), for sorting ant partitioning (c.f. RadixSort)
Comparison based sorting generally has a lower bound of O(n log n).
Assume you iterate over your set of items and put them in buckets with items with the same comparative value, for example in a set of lists (say using a hash set). This operation is clearly O(n), even after retreiving the list of lists from the set.
--- EDIT: ---
This of course requires two assumptions:
There exists a constant time hash-algorithm for each element to be partitioned.
The number of buckets does not depend on the amount of input.
Thus, the lower bound of partitioning is O(n).
Partitioning is faster than sorting, in general, because you don't have to compare each element to each potentially-equivalent already-sorted element, you only have to compare it to the already-established keys of your partitioning. Take a close look at radix sort. The first step of radix sort is to partition the input based on some part of the key. Radix sort is O(kN). If your data set has keys bounded by a given length k, you can radix sort it O(n). If your data are comparable and don't have a bounded key, but you choose a bounded key with which to partition the set, the complexity of sorting the set would be O(n log n) and the partitioning would be O(n).
This is a classic problem in data structures, and yes, it is easier than sorting. If you want to also quickly be able to look up which set each element belongs to, what you want is the disjoint set data structure, together with the union-find operation. See here: http://en.wikipedia.org/wiki/Disjoint-set_data_structure
The time required to perform a possibly-imperfect partition using a hash function will be O(n+bucketcount) [not O(n*bucketcount)]. Making the bucket count large enough to avoid all collisions will be expensive, but if the hash function works at all well there should be a small number of distinct values in each bucket. If one can easily generate multiple statistically-independent hash functions, one could take each bucket whose keys don't all match the first one and use another hash function to partition the contents of that bucket.
Assuming a constant number of buckets on each step, the time is going to be O(NlgN), but if one sets the number of buckets to something like sqrt(N), the average number of passes should be O(1) and the work in each pass O(n).

Can hash tables really be O(1)?

It seems to be common knowledge that hash tables can achieve O(1), but that has never made sense to me. Can someone please explain it? Here are two situations that come to mind:
A. The value is an int smaller than the size of the hash table. Therefore, the value is its own hash, so there is no hash table. But if there was, it would be O(1) and still be inefficient.
B. You have to calculate a hash of the value. In this situation, the order is O(n) for the size of the data being looked up. The lookup might be O(1) after you do O(n) work, but that still comes out to O(n) in my eyes.
And unless you have a perfect hash or a large hash table, there are probably several items per bucket. So, it devolves into a small linear search at some point anyway.
I think hash tables are awesome, but I do not get the O(1) designation unless it is just supposed to be theoretical.
Wikipedia's article for hash tables consistently references constant lookup time and totally ignores the cost of the hash function. Is that really a fair measure?
Edit: To summarize what I learned:
It is technically true because the hash function is not required to use all the information in the key and so could be constant time, and because a large enough table can bring collisions down to near constant time.
It is true in practice because over time it just works out as long as the hash function and table size are chosen to minimize collisions, even though that often means not using a constant time hash function.
You have two variables here, m and n, where m is the length of the input and n is the number of items in the hash.
The O(1) lookup performance claim makes at least two assumptions:
Your objects can be equality compared in O(1) time.
There will be few hash collisions.
If your objects are variable size and an equality check requires looking at all bits then performance will become O(m). The hash function however does not have to be O(m) - it can be O(1). Unlike a cryptographic hash, a hash function for use in a dictionary does not have to look at every bit in the input in order to calculate the hash. Implementations are free to look at only a fixed number of bits.
For sufficiently many items the number of items will become greater than the number of possible hashes and then you will get collisions causing the performance rise above O(1), for example O(n) for a simple linked list traversal (or O(n*m) if both assumptions are false).
In practice though the O(1) claim while technically false, is approximately true for many real world situations, and in particular those situations where the above assumptions hold.
You have to calculate the hash, so the order is O(n) for the size of the data being looked up. The lookup might be O(1) after you do O(n) work, but that still comes out to O(n) in my eyes.
What? To hash a single element takes constant time. Why would it be anything else? If you're inserting n elements, then yes, you have to compute n hashes, and that takes linear time... to look an element up, you compute a single hash of what you're looking for, then find the appropriate bucket with that. You don't re-compute the hashes of everything that's already in the hash table.
And unless you have a perfect hash or a large hash table there are probably several items per bucket so it devolves into a small linear search at some point anyway.
Not necessarily. The buckets don't necessarily have to be lists or arrays, they can be any container type, such as a balanced BST. That means O(log n) worst case. But this is why it's important to choose a good hashing function to avoid putting too many elements into one bucket. As KennyTM pointed out, on average, you will still get O(1) time, even if occasionally you have to dig through a bucket.
The trade off of hash tables is of course the space complexity. You're trading space for time, which seems to be the usual case in computing science.
You mention using strings as keys in one of your other comments. You're concerned about the amount of time it takes to compute the hash of a string, because it consists of several chars? As someone else pointed out again, you don't necessarily need to look at all the chars to compute the hash, although it might produce a better hash if you did. In that case, if there are on average m chars in your key, and you used all of them to compute your hash, then I suppose you're right, that lookups would take O(m). If m >> n then you might have a problem. You'd probably be better off with a BST in that case. Or choose a cheaper hashing function.
The hash is fixed size - looking up the appropriate hash bucket is a fixed cost operation. This means that it is O(1).
Calculating the hash does not have to be a particularly expensive operation - we're not talking cryptographic hash functions here. But that's by the by. The hash function calculation itself does not depend on the number n of elements; while it might depend on the size of the data in an element, this is not what n refers to. So the calculation of the hash does not depend on n and is also O(1).
Hashing is O(1) only if there are only constant number of keys in the table and some other assumptions are made. But in such cases it has advantage.
If your key has an n-bit representation, your hash function can use 1, 2, ... n of these bits. Thinking about a hash function that uses 1 bit. Evaluation is O(1) for sure. But you are only partitioning the key space into 2. So you are mapping as many as 2^(n-1) keys into the same bin. using BST search this takes up to n-1 steps to locate a particular key if nearly full.
You can extend this to see that if your hash function uses K bits your bin size is 2^(n-k).
so K-bit hash function ==> no more than 2^K effective bins ==> up to 2^(n-K) n-bit keys per bin ==> (n-K) steps (BST) to resolve collisions. Actually most hash functions are much less "effective" and need/use more than K bits to produce 2^k bins. So even this is optimistic.
You can view it this way -- you will need ~n steps to be able to uniquely distinguish a pair of keys of n bits in the worst case. There is really no way to get around this information theory limit, hash table or not.
However, this is NOT how/when you use hash table!
The complexity analysis assumes that for n-bit keys, you could have O(2^n) keys in the table (e.g. 1/4 of all possible keys). But most if not all of the time we use hash table, we only have a constant number of the n-bit keys in the table. If you only want a constant number of keys in the table, say C is your maximum number, then you could form a hash table of O(C) bins, that guarantees expected constant collision (with a good hash function); and a hash function using ~logC of the n bits in the key. Then every query is O(logC) = O(1). This is how people claim "hash table access is O(1)"/
There are a couple of catches here -- first, saying you don't need all the bits may only be a billing trick. First you cannot really pass the key value to the hash function, because that would be moving n bits in the memory which is O(n). So you need to do e.g. a reference passing. But you still need to store it somewhere already which was an O(n) operation; you just don't bill it to the hashing; you overall computation task cannot avoid this. Second, you do the hashing, find the bin, and found more than 1 keys; your cost depends on your resolution method -- if you do comparison based (BST or List), you will have O(n) operation (recall key is n-bit); if you do 2nd hash, well, you have the same issue if 2nd hash has collision. So O(1) is not 100% guaranteed unless you have no collision (you can improve the chance by having a table with more bins than keys, but still).
Consider the alternative, e.g. BST, in this case. there are C keys, so a balanced BST will be O(logC) in depth, so a search takes O(logC) steps. However the comparison in this case would be an O(n) operation ... so it appears hashing is a better choice in this case.
TL;DR: Hash tables guarantee O(1) expected worst case time if you pick your hash function uniformly at random from a universal family of hash functions. Expected worst case is not the same as average case.
Disclaimer: I don't formally prove hash tables are O(1), for that have a look at this video from coursera [1]. I also don't discuss the amortized aspects of hash tables. That is orthogonal to the discussion about hashing and collisions.
I see a surprisingly great deal of confusion around this topic in other answers and comments, and will try to rectify some of them in this long answer.
Reasoning about worst case
There are different types of worst case analysis. The analysis that most answers have made here so far is not worst case, but rather average case [2]. Average case analysis tends to be more practical. Maybe your algorithm has one bad worst case input, but actually works well for all other possible inputs. Bottomline is your runtime depends on the dataset you're running on.
Consider the following pseudocode of the get method of a hash table. Here I'm assuming we handle collision by chaining, so each entry of the table is a linked list of (key,value) pairs. We also assume the number of buckets m is fixed but is O(n), where n is the number of elements in the input.
function get(a: Table with m buckets, k: Key being looked up)
bucket <- compute hash(k) modulo m
for each (key,value) in a[bucket]
return value if k == key
return not_found
As other answers have pointed out, this runs in average O(1) and worst case O(n). We can make a little sketch of a proof by challenge here. The challenge goes as follows:
(1) You give your hash table algorithm to an adversary.
(2) The adversary can study it and prepare as long as he wants.
(3) Finally the adversary gives you an input of size n for you to insert in your table.
The question is: how fast is your hash table on the adversary input?
From step (1) the adversary knows your hash function; during step (2) the adversary can craft a list of n elements with the same hash modulo m, by e.g. randomly computing the hash of a bunch of elements; and then in (3) they can give you that list. But lo and behold, since all n elements hash to the same bucket, your algorithm will take O(n) time to traverse the linked list in that bucket. No matter how many times we retry the challenge, the adversary always wins, and that's how bad your algorithm is, worst case O(n).
How come hashing is O(1)?
What threw us off in the previous challenge was that the adversary knew our hash function very well, and could use that knowledge to craft the worst possible input.
What if instead of always using one fixed hash function, we actually had a set of hash functions, H, that the algorithm can randomly choose from at runtime? In case you're curious, H is called a universal family of hash functions [3]. Alright, let's try adding some randomness to this.
First suppose our hash table also includes a seed r, and r is assigned to a random number at construction time. We assign it once and then it's fixed for that hash table instance. Now let's revisit our pseudocode.
function get(a: Table with m buckets and seed r, k: Key being looked up)
rHash <- H[r]
bucket <- compute rHash(k) modulo m
for each (key,value) in a[bucket]
return value if k == key
return not_found
If we try the challenge one more time: from step (1) the adversary can know all the hash functions we have in H, but now the specific hash function we use depends on r. The value of r is private to our structure, the adversary cannot inspect it at runtime, nor predict it ahead of time, so he can't concoct a list that's always bad for us. Let's assume that in step (2) the adversary chooses one function hash in H at random, he then crafts a list of n collisions under hash modulo m, and sends that for step (3), crossing fingers that at runtime H[r] will be the same hash they chose.
This is a serious bet for the adversary, the list he crafted collides under hash, but will just be a random input under any other hash function in H. If he wins this bet our run time will be worst case O(n) like before, but if he loses then well we're just being given a random input which takes the average O(1) time. And indeed most times the adversary will lose, he wins only once every |H| challenges, and we can make |H| be very large.
Contrast this result to the previous algorithm where the adversary always won the challenge. Handwaving here a bit, but since most times the adversary will fail, and this is true for all possible strategies the adversary can try, it follows that although the worst case is O(n), the expected worst case is in fact O(1).
Again, this is not a formal proof. The guarantee we get from this expected worst case analysis is that our run time is now independent of any specific input. This is a truly random guarantee, as opposed to the average case analysis where we showed a motivated adversary could easily craft bad inputs.
TL-DR; usually hash() is O(m) where m is length of a key
My three cents.
24 years ago when Sun released jdk 1.2 they fixed a bug in String.hashCode() so instead of computing a hash only based on some portion of a string since jdk1.2 it reads every single character of a string instead. This change was intentional and IHMO very wise.
In most languages builtin hash works similar. It process the whole object to compute a hash because keys are usually small while collisions can cause serious issues.
There are a lot of theoretical arguments confirming and denying the O(1) hash lookup cost. A lot of them are reasonable and educative.
Let us skip the theory and do some experiment instead:
import timeit
samples = [tuple("LetsHaveSomeFun!")] # better see for tuples
# samples = ["LetsHaveSomeFun!"] # hash for string is much faster. Increase sample size to see
for _ in range(25 if isinstance(samples[0], str) else 20):
samples.append(samples[-1] * 2)
empty = {}
for i, s in enumerate(samples):
t = timeit.timeit(lambda: s in empty, number=2000)
print(f"{i}. For element of length {len(s)} it took {t:0.3f} time to lookup in empty hashmap")
When I run it I get:
0. For element of length 16 it took 0.000 time to lookup in empty hashmap
1. For element of length 32 it took 0.000 time to lookup in empty hashmap
2. For element of length 64 it took 0.001 time to lookup in empty hashmap
3. For element of length 128 it took 0.001 time to lookup in empty hashmap
4. For element of length 256 it took 0.002 time to lookup in empty hashmap
5. For element of length 512 it took 0.003 time to lookup in empty hashmap
6. For element of length 1024 it took 0.006 time to lookup in empty hashmap
7. For element of length 2048 it took 0.012 time to lookup in empty hashmap
8. For element of length 4096 it took 0.025 time to lookup in empty hashmap
9. For element of length 8192 it took 0.048 time to lookup in empty hashmap
10. For element of length 16384 it took 0.094 time to lookup in empty hashmap
11. For element of length 32768 it took 0.184 time to lookup in empty hashmap
12. For element of length 65536 it took 0.368 time to lookup in empty hashmap
13. For element of length 131072 it took 0.743 time to lookup in empty hashmap
14. For element of length 262144 it took 1.490 time to lookup in empty hashmap
15. For element of length 524288 it took 2.900 time to lookup in empty hashmap
16. For element of length 1048576 it took 5.872 time to lookup in empty hashmap
17. For element of length 2097152 it took 12.003 time to lookup in empty hashmap
18. For element of length 4194304 it took 25.176 time to lookup in empty hashmap
19. For element of length 8388608 it took 50.399 time to lookup in empty hashmap
20. For element of length 16777216 it took 99.281 time to lookup in empty hashmap
Clearly the hash is O(m) where m is the length of a key.
You can make similar experiments for other mainstream languages and I expect you get a similar results.
It seems based on discussion here, that if X is the ceiling of (# of elements in table/# of bins), then a better answer is O(log(X)) assuming an efficient implementation of bin lookup.
There are two settings under which you can get O(1) worst-case times.
If your setup is static, then FKS hashing will get you worst-case O(1) guarantees. But as you indicated, your setting isn't static.
If you use Cuckoo hashing, then queries and deletes are O(1)
worst-case, but insertion is only O(1) expected. Cuckoo hashing works quite well if you have an upper bound on the total number of inserts, and set the table size to be roughly 25% larger.
Copied from here
A. The value is an int smaller than the size of the hash table. Therefore, the value is its own hash, so there is no hash table. But if there was, it would be O(1) and still be inefficient.
This is a case where you could trivially map the keys to distinct buckets, so an array seems a better choice of data structure than a hash table. Still, the inefficiencies don't grow with the table size.
(You might still use a hash table because you don't trust the ints to remain smaller than the table size as the program evolves, you want to make the code potentially reusable when that relationship doesn't hold, or you just don't want people reading/maintaining the code to have to waste mental effort understanding and maintaining the relationship).
B. You have to calculate a hash of the value. In this situation, the order is O(n) for the size of the data being looked up. The lookup might be O(1) after you do O(n) work, but that still comes out to O(n) in my eyes.
We need to distinguish between the size of the key (e.g. in bytes), and the size of the number of keys being stored in the hash table. Claims that hash tables provide O(1) operations mean that operations (insert/erase/find) don't tend to slow down further as the number of keys increases from hundreds to thousands to millions to billions (at least not if all the data is accessed/updated in equally fast storage, be that RAM or disk - cache effects may come into play but even the cost of a worst-case cache miss tends to be some constant multiple of best-case hit).
Consider a telephone book: you may have names in there that are quite long, but whether the book has 100 names, or 10 million, the average name length is going to be pretty consistent, and the worst case in history...
Guinness world record for the Longest name used by anyone ever was set by Adolph Blaine Charles David Earl Frederick Gerald Hubert Irvin John Kenneth Lloyd Martin Nero Oliver Paul Quincy Randolph Sherman Thomas Uncas Victor William Xerxes Yancy Wolfeschlegelsteinhausenbergerdorff, Senior
...wc tells me that's 215 characters - that's not a hard upper-bound to the key length, but we don't need to worry about there being massively more.
That holds for most real world hash tables: the average key length doesn't tend to grow with the number of keys in use. There are exceptions, for example a key creation routine might return strings embedding incrementing integers, but even then every time you increase the number of keys by an order of magnitude you only increase the key length by 1 character: it's not significant.
It's also possible to create a hash from a fixed-size amount of key data. For example, Microsoft's Visual C++ ships with a Standard Library implementation of std::hash<std::string> that creates a hash incorporating just ten bytes evenly spaced along the string, so if the strings only vary at other indices you get collisions (and hence in practice non O(1) behaviours on the post-collision searching side), but the time to create the hash has a hard upper bound.
And unless you have a perfect hash or a large hash table, there are probably several items per bucket. So, it devolves into a small linear search at some point anyway.
Generally true, but the awesome thing about hash tables is that the number of keys visited during those "small linear searches" is - for the separate chaining approach to collisions - a function of the hash table load factor (ratio of keys to buckets).
For example, with a load factor of 1.0 there's an average of ~1.58 to the length of those linear searches, regardless of the number of keys (see my answer here). For closed hashing it's a bit more complicated, but not much worse when the load factor isn't too high.
It is technically true because the hash function is not required to use all the information in the key and so could be constant time, and because a large enough table can bring collisions down to near constant time.
This kind of misses the point. Any kind of associative data structure ultimately has to do operations across every part of the key sometimes (inequality may sometimes be determined from just a part of the key, but equality generally requires every bit be considered). At a minimum, it can hash the key once and store the hash value, and if it uses a strong enough hash function - e.g. 64-bit MD5 - it might practically ignore even the possibility of two keys hashing to the same value (a company I worked for did exactly that for the distributed database: hash-generation time was still insignificant compared to WAN-wide network transmissions). So, there's not too much point obsessing about the cost to process the key: that's inherent in storing keys regardless of the data structure, and as said above - doesn't tend to grow worse on average with there being more keys.
As for large enough hash tables bringing collisions down, that's missing the point too. For separate chaining, you still have a constant average collision chain length at any given load factor - it's just higher when the load factor is higher, and that relationship is non-linear. The SO user Hans comments on my answer also linked above that:
average bucket length conditioned on nonempty buckets is a better measure of efficiency. It is a/(1-e^{-a}) [where a is the load factor, e is 2.71828...]
So, the load factor alone determines the average number of colliding keys you have to search through during insert/erase/find operations. For separate chaining, it doesn't just approach being constant when the load factor is low - it's always constant. For open addressing though your claim has some validity: some colliding elements are redirected to alternative buckets and can then interfere with operations on other keys, so at higher load factors (especially > .8 or .9) collision chain length gets more dramatically worse.
It is true in practice because over time it just works out as long as the hash function and table size are chosen to minimize collisions, even though that often means not using a constant time hash function.
Well, the table size should result in a sane load factor given the choice of close hashing or separate chaining, but also if the hash function is a bit weak and the keys aren't very random, having a prime number of buckets often helps reduce collisions too (hash-value % table-size then wraps around such that changes only to a high order bit or two in the hash-value still resolve to buckets spread pseudo-randomly across different parts of the hash table).
Leaving other considerations aside, the O(1) claim hinges on a constant time access model of memory, which is a good enough approximation for most practical computer science but not strictly justifiable from a theoretical point of view.
For starters, any memory addressing scheme necessarily requires multiplexing at the circuit level, which in turn requires a circuit depth at least proportional to O(log N). Since clock frequency is inversely proportional to the longest path (in number of traversed gates) of a circuit, this implies no general memory access scheme can run in less than O(log N) for fast enough CPUs or large enough memories.
Then, at a more fundamental level, you can only stack so many bits of memory within a finite distance D from the processor, and given the finite speed of light this means your worst case time for a random memory access is at least O(D^1/3), and more likely O(D^1/2) if we take into account integrated circuits are two-dimensional.
But of course in practice computers operate far from reaching these limits... or do they? This is when cache hierarchies enter the game, and why no good implementation of an algorithm or data structure can afford to ignore the actual details of the use case or the hardware implementation.
Either way the absolute worst case for a random memory access timing is given by the ping latency between your computer and some server at the opposite side of the planet, which can be in the 100s of ms and is, for the record, a lot worse than the best case scenario of having the data cached in L1 or -even better- already loaded in the registers.
As for the cost of hashing, you are correct in that it cannot be truly constant or even bounded by a set number of operations when applied to a potentially unbounded set of arbitrary-size keys such as strings, which can only be dealt with efficiently for the randomized case, but often do share arbitrarily long common prefixes that require reading and processing a number of bits larger than the size of the prefix.
For such cases it may be advisable to use a specialized data structure such as a z-fast trie or similar, which can simultaneously disambiguate prefixes and perform random memory access in amortized O(lg lg lg N).

Resources