Currently, I am asked to devise an O(n log n) algorithm for inserting n elements into a hash table with n slots using linear probing.
Originally, it would take up to O(n^2) time for inserting n elements, if the hash values generated by the hash function end up being a single number.
Therefore, I am thinking about preventing the collisions before hashing any elements, by predicting them using a certain type of data structure.
For example: calculate all the hash values for each element (which is O(n)), search for possible collisions, change the hash values of the colliding keys, and then do insertion.
My question: is it possible to find a data structure that solves my problem in O(n log n) time?
Many thanks.
To start, initialize a vEB tree to contain 0..n-1, representing the open slots of the hash table. To insert an element into the hash table, call the FindNext method of the vEB tree once or twice (if there is wraparound) to determine the next free slot and then call Delete to remove it from the vEB tree. The total running time is O(n log log n).
Related
Is there a data structure with elements that can be indexed whose insertion runtime is O(1)? So for example, I could index the data structure like so: a[4], and yet when inserting an element at an arbitrary place in the data structure that the runtime is O(1)? Note that the data structure does not maintain sorted order, just the ability for each sequential element to have an index.
I don't think its possible, since inserting somewhere that is not at the end or beginning of the ordered data structure would mean that all the indicies after insertion must be updated to know that their index has increased by 1, which would take worst case O(n) time. If the answer is no, could someone prove it mathematically?
EDIT:
To clarify, I want to maintain the order of insertion of elements, so upon inserting, the item inserted remains sequentially between the two elements it was placed between.
The problem that you are looking to solve is called the list labeling problem.
There are lower bounds on the cost that depend on the relationship between the the maximum number of labels you need (n), and the number of possible labels (m).
If n is in O(log m), i.e., if the number of possible labels is exponential in the number of labels you need at any one time, then O(1) cost per operation is achievable... but this is not the usual case.
If n is in O(m), i.e., if they are proportional, then O(log2 n) per operation is the best you can do, and the algorithm is complicated.
If n <= m2, then you can do O(log N). Amortized O(log N) is simple, and O(log N) worst case is hard. Both algorithms are described in this paper by Dietz and Sleator. The hard way makes use of the O(log2 n) algorithm mentioned above.
HOWEVER, maybe you don't really need labels. If you just need to be able to compare the order of two items in the collection, then you are solving a slightly different problem called "list order maintenance". This problem can actually be solved in constant time -- O(1) cost per operation and O(1) cost to compare the order of two items -- although again O(1) amortized cost is a lot easier to achieve.
When inserting into slot i, append the element which was first at slot i to the end of the sequence.
If the sequence capacity must be grown, then this growing may not necessarily be O(1).
So I've been given the next question:
Describe a data structure by the following interface:
The structure will contain n elements, where each element holds a key and a value (meaning, each element is (key, value)).
insert ((key, value)): insert an element in O(1) average case and O(log n) worst case.
delete ((key, value)): delete the element that correspond to the given key in O(1) average case and O(log n) worst case.
find (key): find the element that correspond to the given key and return it's value in O(1) average case and O(log n) worst case.
setAll (m): change the value of each element in the structure to be m in O(1) worst case
So my main thought was to use a hash table to ensure O(1) average case runtime for insert, delete, find. the hash table will be implmented by chaining, but instead of linked list, use an AVL tree, so in the worst case, insert, delete and find will be O(log n).
But I got stuck on setAll. I can't deal with this problem in O(1) worst case runtime. I know that you can't really change all the values because it requires traversal on the elements, so I thought maybe I can use global variables and keep track of the calls for setAll but I can't really see how I implement such thing.
In addition, there is no limit on the space complxity, which is why I used a hash table containing AVL trees. This is also a clue that our lecturer gave us.
A hash table with AVL trees is a good start.
To implement setAll(m), keep an operation counter, and mark each entry with update_op = operation_count during insert.
When setAll(m) is called, set a field last_reset = operation_count and reset_value = m.
Then modify find so it returns reset_value for any entry with update_op < last_reset.
Big-O - In order to add an element to a HashSet the complexity is O(1), then how does a HashSet determine if the element to be added is unique or not ?
This depends on the choice of hash table, but in most traditional implementations of hash tables (linear probing, chained hashing, quadratic probing, etc.) the cost of doing an insertion is on expectation O(1) rather than worst-case O(1). The O(1) term here comes from the fact that on average only a constant number of elements in the hash table need to be looked at in the course of inserting a new value. In linear probing, this happens because the expected length of the run of elements to check is O(1), and in chained hashing it's because there are only on expectation O(1) elements in the bucket in question. The hash table then can check whether the newly-inserted element is equal to any of these values when performing an insertion. If so, we know it's a duplicate. If not, then we know the new element can't be a duplicate; we've checked every element that could possibly be equal to it. That means that on average the cost of an insertion is O(1) and we can check for uniqueness at the time of insertion.
I believe inserting into a hash table is average-case O(1) and worst-case O(n). So if we loop through a string and add each word to a hash table (which maps the word to the number of times it occurs in the string), wouldn't that be worst-case O(n^2) run-time? I tried to ask this before, but the answers said it was worst-case O(n). Thanks!
You are right that under reasonable assumptions, a hash table will insert elements in O(1) average time and O(n) worse case time.
As for your problem, assuming you have n words in a string, you would have to iterate over each word and enter it into the hash table which would take O(n) average time or O(n^2) worst case time.
The worst case of insert depends on how the implementation of insert function handles collisions and resolution techniques. This will have a greater influence in both put() and get() operations. The collision resolution techniques are implemented differently in each libraries. The core idea is to maintain all colliding keys in the same bucket. And during retrieval traverse all the colliding keys and apply some equality check to retrieve the given key. Important thing to note is we need to maintain both 'keys' and 'values' in the bucket, to facilitate the above mentioned equality check.
Another thing to consider is, during insertion operation a hashcode will be generated for the given key. We can consider this to be constant O(1) for every key.
In worst case, all the keys could fall in the same bucket and hence O(n) for 1 get(). But for put() operation it is always constant O(1) irrespective of the collision.
Maintaining the list of colliding is a key factor. Some implementation​s are doing with BST rather than a linked list. Hence, worst case is O(log N) for insertion and retrieval.
At any cost, O(N log N) could be the runtime of inserting N elements not O(N^2).
Any decent implementation has to ensure the minimum colliding hash code of objects being generated, to have better performance.
I have a database of users with their usernames and id's. These are the operations that program will process:
insert, delete (by username), search (by username), print (prints all users info, sorted by their id)
time complexity of first 3 operations shouldn't be more than O(log n) and for print it should be O(n). solution should be implemented with a balanced BST.
My idea to solve the problem is to have to 2 BST, key of one is id and for another is username. So we can access an element by their name or id both in O(log n) time. But this doubles memory space and time of operations.
Is there a way to access elements both by their username and id in O(log n) time in a better way than what i explained?
My idea to solve the problem is to have to 2 BST, key of one is id and
for another is username. So we can access an element by their username or
id both in O(log n) time. But this doubles memory space and time of
operations.
What you propose will indeed double the memory and time requirements for your data structure. (Only insertions and deletions will take double time. The other operations will take no extra time). However, recall that O(2 log n) is generally treated the same as O(log n) and is much less than O(n). As an illustration, I've graphed 2 log n and n. Note that they are equal when n is 2 or 4. log n is essentially a flat line compared to n.
I propose that you cannot do better than this using balanced BSTs (or at all, for that matter). Since you need to search based on username in O(log n) time, username must be the key for the tree. However, you also need to retrieve the users sorted by id in O(n) time. That essentially forbids you from sorting them after retrieving them, because you won't be able to sort them faster than O(n log n). Thus, they must already be sorted by id. Therefore, id must be a key for the tree. Hence, you need two trees.
While 2 trees are fine, you can also use a hash table for lookup and delete plus a sorted index for printing. A red-black tree will be fine for the sorted index.
However, if IDs are consecutive non-negative integers, it will be even more efficient to maintain a simple array, where position i contains the object with the ID of i. Now you can print by just traversing the array. And the hash table values can be IDs, for these "point" to the respective object in the array.