Hash Collision Linear Probing Running Time - algorithm

I am trying to do homework with a friend and one question asks the average running time of search, add, and delete for the linear probing method. I think it's O(n) because it has to check at certain number of nodes until it finds an open one to add. And when searching it starts at the original index and moves up until it finds the desired number. But my friends says it's O(1). Which one is right?

When we talk about Asymptotic complexities we generally take into account very large n. Now for collision handling in a Hash Table some of the methods are chained hashing & linear probing. In both the cases two things may happen (that will help in answering your question): 1. You may require resizing of the hash table due to it getting full 2. Collisions may happen.
In the worst case it will depend on how you have implemented your hash table, say in linear probing you dont find the number,you keep on moving and the number you were looking for was at the end. Here comes the O(n) worst case running time. Coming to chained hashing technique, when a collision happens, to handle them say we have stored the keys in a balanced binary tree so the worst case running time would be O(log n).
Now coming to best case running time, I think there is no confusion, in either case it would be O(1).
O(n) would happen in worst case and not in an average case of a good designed hash table. If that starts happening in average case hash tables wont find a place in Data Structures because then balanced trees on an average would give you O(log n) always and ON TOP OF THAT will preserve the order too.
Sorry to say this but unfortunately your friend is right. Your case would happen in worst case scenarios.
Also look here for more informative stuff i.e. the amortized running time: Time complexity of Hash table

Related

determining if an integer exists in a set in O(1) expected time and O(logn) worst case

i need describe a data structure that is able to determine where or not a particular integer exists in a set in O(1) expected time and O(logn) worst case also consuming O(n) space. ive had a look at a table containing common data structures and there big-O time/space complexities but i cant seem to find any that fit these requirements? is there a way to modify a BST to fit these requirements?
As #Carcigenicate commented, a hashmap exhibits the behavior you want. It has constant O(1) expected lookup time, except in the case of collisions. In the case of collisions, hashmaps typically create a list of items for a given bucket. In the worst case scenario, a hashmap would behave like a list. But this would imply a worst case search time of O(n), which does not fit your requirement.
Java, in its latest 8 version, exposed a HashMap class which uses balanced trees instead of lists to store items which collide with the same bucket. This guarantees a worst case search time of O(logn).
So, to solve your problem, you would need to modify the implementation of a hashmap to use trees for collisions. If you are using Java 8, then life is already good and you can just run with HashMap.

Complexity of maintaining a sorted list vs inserting all values then sorting

Would the time and space complexity to maintain a list of numbers in sorted order (i.e start with the first one insert it, 2nd one comes along you insert it in sorted order and so on ..) be the same as inserting them as they appear and then sorting after all insertions have been made?
How do I make this decision? Can you demonstrate in terms of time and space complexity for 'n' elements?
I was thinking in terms of phonebook, what is the difference of storing it in a set and presenting sorted data to the user each time he inserts a record into the phonebook VS storing the phonebook records in a sorted order in a treeset. What would it be for n elements?
Every time you insert into a sorted list and maintain its sortedness, it is O(logn) comparisons to find where to place it but O(n) movements to place it. Since we insert n elements this is O(n^2). But, I think that if you use a data structure that is designed for inserting sorted data into (such as a binary tree) then do a pass at the end to turn it into a list/array, it is only O(nlogn). On the other hand, using such a more complex data structure will use about O(n) additional space, whereas all other approaches can be done in-place and use no additional space.
Every time you insert into an unsorted list it is O(1). Sorting it all at the end is O(nlogn). This means overall it is O(nlogn).
However, if you are not going to make lists of many elements (1000 or less) it probably doesn't matter what big-O it is, and you should either focus on what runs faster for small data sets, or not worry at all if it is not a performance issue.
It depends on what data structure you are inserting them in. If you are asking about inserting in an array, the answer is no. It takes O(n) space and time to store the n elements, and then O(n log n) to sort them, so O(n log n) total. While inserting into an array may require you to move \Omega(n) elements so takes \Theta(n^2). The same problem will be true with most "sequential" data structures. Sorry.
On the other hand, some priority queues such as lazy leftist heaps, fibonacci heaps, and Brodal queues have O(1) insert. While, a Finger Tree gives O(n log n) insert AND linear access (Finger trees are as good as a linked list for what a linked list is good for and as good as balanced binary search trees for what binary search trees are good for--they are kind of amazing).
There are going to be application-specific trade-offs to algorithm selection. The reasons one might use an insertion sort rather than some kind of offline sorting algorithm are enumerated on the Insertion Sort wikipedia page.
The determining factor here is less likely to be asymptotic complexity and more likely to be what you know about your data (e.g., is it likely to be already sorted?)
I'd go further, but I'm not convinced that this isn't a homework question asked verbatim.
Option 1
Insert at correct position in sorted order.
Time taken to find the position for i+1-th element :O(logi)
Time taken to insert and maintain order for i+1-th element: O(i)
Space Complexity:O(N)
Total time:(1*log 1 +2*log 2 + .. +(N-1)*logN-1) =O(NlogN)
Understand that this is just the time complexity.The running time can be very different from this.
Option 2:
Insert element O(1)
Sort elements O(NlogN)
Depending on the sort you employ the space complexity varies, though you can use something like quicksort, which doesn't need much space anyway.
In conclusion though both time complexity are the same, the bounds are weak and mathematically you can come up with better bounds.Also note that worst case complexity may never be encountered in practical situations, probably you will see only average cases all the time.If performance is such a vital issue in your application, you should test both sets of code on random sampling.Do tell me which one works faster after your tests.My guess is option 1.

Amortized performance of LinkedList vs HashMap

The amortised performance of Hash tables is often said to be O(1) for most operations.
What is the amortized performance for a search operation on say a standard LinkedList implementation? Is it O(n)?
I'm a little confused on how this is computed, since in the worst-case (assuming say a hash function that always collides), a Hash table is pretty much equivalent to a LinkedList in terms of say a search operation (assuming a standard bucket implementation).
I know in practise this would never happen unless the hash function was broken, and so the average performance is almost constant time over a series of operations since collisions are rare. But when calculating amortized worst-case performance, shouldn't we consider the worst-case sequence with the worst-case implementation?
There is no such thing as "amortized worst-case performance". Amortized performance is a kind of "average" performance over a long sequence of operations.
With a hash table, sometimes the hash table will need to be resized after a long sequence of inserts, which will take O(n) time. But, since it only happens every O(n) inserts, that operation's cost is spread out over all the inserts to get O(1) amortized time.
Yes, a hash table could be O(n) for every operation in the worst case of a broken hash function. But, analyzing such a hash table is meaningless because it won't be the case for typical usage.
"Worst case" sometimes depends on "worst case under what constraints".
The case of a hashtable with a valid but stupid hash function mapping all keys to 0 generally isn't a meaningful "worst case", it's not sufficiently interesting. So you can analyse a hashtable's average performance under the minimal assumption that (for practical purposes) the hash function distributes the set of all keys uniformly across the set of all hash values.
If the hash function is reasonably sound but not cryptographically secure there's a separate "worst case" to consider. A malicious or unwitting user could systematically provide data whose hashes collide. You'd come up with a different answer for the "worst case input" vs the "worst case assuming input with well-distributed hashes".
In a given sequence of insertions to a hashtable, one of them might provoke a rehash. Then you would consider that one the "worst case" in that particular. This has very little to do with the input data overall -- if the load factor gets high enough you're going to rehash eventually but rarely. That's why the "amortised" running time is an interesting measure, whenever you can put a tighter upper bound on the total cost of n operations than just n times the tightest upper bound on one operation.
Even if the hash function is cryptographically secure, there is a negligible probability that you could get input whose hashes all collide. This is where there's a difference between "averaging over all possible inputs" and "averaging over a sequence of operations with worst-case input". So the word "amortised" also comes with small print. In my experience it normally means the average over a series of operations, and the issue of whether the data is a good or a bad case is not part of the amortisation. nneonneo says that "there's no such thing as amortized worst-case performance", but in my experience there certainly is such a thing as worst-case amortised performance. So it's worth being precise, since this might reflect a difference in what we each expect the term to mean.
When hashtables come up with O(1) amortized insertion, they mean that n insertions takes O(n) time, either (a) assuming that nothing pathologically bad happens with the hash function or (b) expected time for n insertions assuming random input. Because you get the same answer for hashtables either way, it's tempting to be lazy about saying which one you're talking about.

Hash Table v/s Trees

Are hashtables always faster than trees? Though Hashtables have O(1) search complexity but suppose if due to badly designed hash function lot of collisions happen and if we handle collisions using chained structure (say a balanced tree) then the worst case running time for search would be O(log n). So can I conclude for big or small data sets even in case of worst case scenarios hash tables will always be faster than trees? Also If I have ample memory and I dont want range searches can I always go for a hash table?
Are hashtables always faster than trees?
No, not always. This depends on many things, such as the size of the collection, the hash function, and for some hash table implementations - also the number of delete ops.
hash-tables are O(1) per op on average - but this is not always the case. They might be O(n) in worst cases.
Some reasons I can think of at the moment to prefer trees:
Ordering is important. [hash-tables are not maintaining order, BST is sorted by definition]
Latency is an issue - and you cannot suffer the O(n) that might occur. [This might be critical for real-time systems]
Ther data might be "similar" related to your hash function, and many elements hashed to the same locations [collisions] is not unprobable. [this can be sometimes solved by using a different hash function]
For relatively small collections - many times the hidden constant between hashtable's O(1) is much higher then the tree's - and using a tree might be faster for small collections.
However - if the data is huge, latency is not an issue and collisions are unprobable - hash-tables are asymptotically better then using a tree.
If due to badly designed hash function lot of collisions happen and if we handle collisions using chained structure (say a balanced tree) then the worst case running time for search would be O(n) (not O(log n)). Therefore you cannot conclude for big or small data sets even in case of worst case scenarios hash tables will always be faster than trees.
Use hashtable, and init it with the proper dimension. For example if you use only half space the collisions are very few.
In worst case scenario you'll have O(n) time in hast-tables. But this is a billions less probable then sun exploding write now, so when using a good hash-function you can safely assume it works in O(1) unless sun explodes.
On the other hand, performance of both Hash-Tables and Trees can vary on implementation, language, and phase of the moon, so the only good answer to this question is "Try both, think and pick better".

Run time to insert n elements into an empty hash table

People say it takes amortized O(1) to put into a hash table. Therefore, putting n elements must be O(n). That's not true for large n, however, since as an answerer said, "All you need to satisfy expected amortized O(1) is to expand the table and rehash everything with a new random hash function any time there is a collision."
So: what is the average running-time of inserting n elements into a hash table? I realize this is probably implementation-dependent, so mention what type of implementation you're talking about.
For example, if there are (log n) equally spaced collisions, and each collision takes O(k) to resolve, where k is the current size of the hashtable, then you'd have this recurrence relation:
T(n) = T(n/2) + n/2 + n/2
(that is, you take the time to insert n/2 elements, then you have a collision, taking n/2 to resolve, then you do the remaining n/2 inserts without a collision). This still ends up being O(n), so yay. But is this reasonable?
It completely depends on how inefficient your rehashing is. Specifically, if you can properly estimate the expected size of your hashtable the second time, your runtime still approaches O(n). Effectively, you have to specify how inefficient your rehash size calculation is before you can determine the expected order.
People say it takes amortized O(1) to put into a hash table.
From a theoretical standpoint, it is expected amortized O(1).
Hash tables are fundamentally a randomized data structure, in the same sense that quicksort is a randomized algorithm. You need to generate your hash functions with some randomness, or else there exist pathological inputs which are not O(1).
You can achieve expected amortized O(1) using dynamic perfect hashing:
The naive idea I originally posted was to rehash with a new random hash function on every collision. (See also perfect hash functions) The problem with this is that this requires O(n^2) space, from birthday paradox.
The solution is to have two hash tables, with the second table for collisions; resolve collisions on that second table by rebuilding it. That table will have O(\sqrt{n}) elements, so would grow to O(n) size.
In practice you often just use a fixed hash function because you can assume (or don't care if) your input is pathological, much like you often quicksort without prerandomizing the input.
All O(1) is saying is that the operation is performed in constant time, and it's not dependent on the number of elements in your data structure.
In simple words, this means that you'll have to pay the same cost no matter how big your data structure is.
In practical terms this means that simple data structures such as trees are generally more effective when you don't have to store a lot of data. In my experience I find trees faster up to ~1k elements (32bit integers), then hash tables take over. But as usual YMMW.
Why not just run a few tests on your system? Maybe if you'll post the source, we can go back and test them on our systems and we could really shape this into a very useful discussion.
It is just not the implementation, but the environment as well that decides how much time the algorithm actually takes. You can however, look if any benchmarking samples are available or not. The problem with me posting my results will be of no use since people have no idea what else is running on my system, how much RAM is free right now and so on. You can only ever have a broad idea. And that is about as good as what the big-O gives you.

Resources