Me and my fellow students have been debating for a good time what the big o notation for this is:
Creating a hashtable with values by iterative insertion (the number of elements is known at the beginning) in the average and worst case.
Average complexity of inserting 1 element is O(1) so inserting n elements in an empty hashtable should be O(n).
Worst case insertion of 1 element is O(n).
So is inserting n elements in an empty hashtable O(n^2) or O(n) and why?
Worst case happens when every insertion results in collision. The cost of collision depends on the hash table implementation. The simplest implementation is usually a linked list of all elements that belong to the same hash cell. So insertion of n elements will cost 1+2+3+..+n time units. This is a sum of arithmetic series and it equals n(n+1)/2=O(n2). This result can be improved by using more advanced data structures to handle collisions. For example, for AVL tree the cost of insertion is O(logn), i.e., for n elements it will be O(log1+log2+...+logn)=O(log(n!)) which is significantly better than O(n2).
Related
Big-O - In order to add an element to a HashSet the complexity is O(1), then how does a HashSet determine if the element to be added is unique or not ?
This depends on the choice of hash table, but in most traditional implementations of hash tables (linear probing, chained hashing, quadratic probing, etc.) the cost of doing an insertion is on expectation O(1) rather than worst-case O(1). The O(1) term here comes from the fact that on average only a constant number of elements in the hash table need to be looked at in the course of inserting a new value. In linear probing, this happens because the expected length of the run of elements to check is O(1), and in chained hashing it's because there are only on expectation O(1) elements in the bucket in question. The hash table then can check whether the newly-inserted element is equal to any of these values when performing an insertion. If so, we know it's a duplicate. If not, then we know the new element can't be a duplicate; we've checked every element that could possibly be equal to it. That means that on average the cost of an insertion is O(1) and we can check for uniqueness at the time of insertion.
Is there any data structure available that would provide O(1) -- i.e. constant -- insertion complexity and O(log(n)) search complexity even in the worst case?
A sorted vector can do a O(log(n)) search but insertion would take O(n) (taken the fact that I am not always inserting the elements either at the front or the back). Whereas a list would do O(1) insertion but would fall short of providing O(log(n)) lookup.
I wonder whether such a data structure can even be implemented.
Yes, but you would have to bend the rules a bit in two ways:
1) You could use a structure that has O(1) insertion and O(1) search (such as the CritBit tree, also called bitwise trie) and add artificial cost to turn search into O(log n).
A critbit tree is like a binary radix tree for bits. It stores keys by walking along the bits of a key (say 32bits) and use the bit to decide whether to navigate left ('0') or right ('1') at every node. The maximum complexity for search and insertion is both O(32), which becomes O(1).
2) I'm not sure that this is O(1) in a strict theoretical sense, because O(1) works only if we limit the value range (to, say, 32 bit or 64 bit), but for practical purposes, this seems a reasonable limitation.
Note that the perceived performance will be O(log n) until a significant part of the possible key permutations are inserted. For example, for 16 bit keys you probably have to insert a significant part of 2^16 = 65563 keys.
No (at least in a model where the elements stored in the data structure can be compared for order only; hashing does not help for worst-case time bounds because there can be one big collision).
Let's suppose that every insertion requires at most c comparisons. (Heck, let's make the weaker assumption that n insertions require at most c*n comparisons.) Consider an adversary that inserts n elements and then looks up one. I'll describe an adversarial strategy that, during the insertion phase, forces the data structure to have Omega(n) elements that, given the comparisons made so far, could be ordered any which way. Then the data structure can be forced to search these elements, which amount to an unsorted list. The result is that the lookup has worst-case running time Omega(n).
The adversary's goal is to give away as little information as possible. Elements are sorted into three groups: winners, losers, and unknown. Initially, all elements are in the unknown group. When the algorithm compares two unknown elements, one chosen arbitrarily becomes a winner and the other becomes a loser. The winner is deemed greater than the loser. Similarly, unknown-loser, unknown-winner, and loser-winner comparisons are resolved by designating one of the elements a winner and the other a loser, without changing existing designations. The remaining cases are loser-loser and winner-winner comparisons, which are handled recursively (so the winners' group has a winner-unknown subgroup, a winner-winners subgroup, and a winner-losers subgroup). By an averaging argument, since at least n/2 elements are compared at most 2*c times, there exists a subsub...subgroup of size at least n/2 / 3^(2*c) = Omega(n). It can be verified that none of these elements are ordered by previous comparisons.
I wonder whether such a data structure can even be implemented.
I am afraid the answer is no.
Searching OK, Insertion NOT
When we look at the data structures like Binary search tree, B-tree, Red-black tree and AVL tree, they have average search complexity of O(log N), but at the same time the average insertion complexity is same as O(log N). Reason is obvious, the search will follow (or navigate through) the same pattern in which the insertion happens.
Insertion OK, Searching NOT
Data structures like Singly linked list, Doubly linked list have average insertion complexity of O(1), but again the searching in Singly and Doubly LL is painful O(N), just because they don't have any indexing based element access support.
Answer to your question lies in the Skiplist implementation, which is a linked list, still it needs O(log N) on average for insertion (when lists are expected to do insertion in O(1)).
On closing notes, Hashmap comes very close to meet the speedy search and speedy insertion requirement with the cost of huge space, but if horribly implemented, it can result into a complexity of O(N) for both insertion and searching.
How would you find the k smallest elements from an unsorted array using quicksort (other than just sorting and taking the k smallest elements)? Would the worst case running time be the same O(n^2)?
You could optimize quicksort, all you have to do is not run the recursive potion on the other portions of the array other than the "first" half until your partition is at position k. If you don't need your output sorted, you can stop there.
Warning: non-rigorous analysis ahead.
However, I think the worst-case time complexity will still be O(n^2). That occurs when you always pick the biggest or smallest element to be your pivot, and you devolve into bubble sort (i.e. you aren't able to pick a pivot that divides and conquers).
Another solution (if the only purpose of this collection is to pick out k min elements) is to use a min-heap of limited tree height ciel(log(k)) (or exactly k nodes). So now, for each insert into the min heap, your maximum time for insert is O(n*log(k)) and the same for removal (versus O(n*log(n)) for both in a full heapsort). This will give the array back in sorted order in linearithmic time worst-case. Same with mergesort.
I'm studying to data structures exam and I'm trying to solve this question:
given an array of n numbers and a number Z, find x,y such as x+y=Z , in O(n) average time.
My suggestion is move the array's content to a hash table, and using open addressing do the following:
For each number A[i] search for Z-A[i] in the hash table (O(1) in average for each operation.) Worst case you'll perform n searches, O(1) average time each, that's O(n) in average.
Is my analysis correct?
Given that you are traversing all your array the second time, yes that is O(n) * O(1) (and not O(n)+O(1) as previously stated from me) (for hash lookup in average time), so you are talking about an algorithm of O(n) complexity .
I have a random ordered array of 30 elements with only 3 distinct keys (TRUE, FALSE and NULL) that I want to sort using insertion sort. What will be the time complexity? Will it be O(n2) assuming worst case or O(n) assuming best case since there are only 3 different keys?
n refers to the size of the array, not the possible elements of the array. Thus, the complexity is the same:
Worst-case: O(n2)
Best-case: O(n)
Average-case: O(n2)
Having 3 distinct elements will reduce the amount of elements you have to check during the "insertion" phase, but only by a constant factor. This will not change the asymptotic run-time.
For example, in the average-case, instead of insert checking n elements, it will check n/3 elements. This is better, but not asymptotically.