Hash tables Ω(n^2) runtime? - big-o

I am really confused about this. Having read the textbook and done exercises I still don't get how it works, and unfortunately I can't go in person to see the professor and it's somewhat difficult to get in touch (summer online course, different time zones). I feel like it would 'click' if I just understood how to do this problem. The textbook details hash functions and runtime individually but I feel like this question is outside the scope of what we've learned. If someone could point me at anything that might help, that would be great.
1) Consider the process of inserting m keys into a hash table T[0..m − 1], where m is a prime, and we use open addressing. The hash function we use is h(k, i) = (k + i) mod m. Give an example of m keys k1, k2 ... km, such that the following sequence of operations takes Ω(n^2) time:
insert(k1), insert(k2), ..., insert(km)
I understand that insert operations are supposed to take O(1) time or, in some cases, O(n). How exactly am I supposed to come up with keys that will turn that into Ω(n^2) time? I'm hoping to understand this and I feel like I'm missing some huge hint, because the textbook chapter seems simple, makes sense to me, and doesn't help with this at all. In the question it's stated that m is a prime, is this important? I'm just so lost, and Google for once fails me.

The keyword here is hash collision:
In order for a hash function to work well, you need the values for a certain input to be well-distributed over all m possible values the entries are stored in. If the hash table has about as many entries as elements were inserted, you can expect every element to be stored at (or near) its hash value (meaning only small amounts of probing are necessary), making access, insertion and deletion a constant-time operation.
If you however find different input values for which the hash function maps to the same value every time (collisions), during insertion the probing step will have to skip over all previously added elements, taking Ω(n) time per element on average. Thus we get a runtime of Ω(n²)

Related

Data Structures & Algorithms Optimal Solution Explanation

im currently doing a ds&a udemy course as i am prepping for the heavy recruiting this upcoming fall. i stumbled upon a problem that prompted along the lines of:
"Given to list arrays, figure out what integer is missing from the second list array that was present in the first list array "
There were two solutions given in the course one which was considered a brute force solution and the other one the more optimal.
Here are the solutions:
def finderBasic(list1,list2):
list1.sort()
list2.sort()
for i in range(len(list1)):
if list1[i] != list2[i]:
return list1[i]
def finderOptimal(list1,list2):
d = collections.defaultdict(int)
for num in list2:
d[num] = 1
for num in list1:
if d[num] == 0:
return num
else:
d[num] -= 1
The course explains that the finderOptimal is a more optimal way of solving the problem as it solves it in O(n) or linearly. Can someone please further explain to me why that is. I just felt like the finderBasic was much more simpler and only went through one loop. Any help would be much appreciated thank you!
You would be correct, if it was only about going through loop, the first solution would be better.
-- as you said, going through one for loop (whole) takes O(n) time, and it doesn't matter if you go through it once, twice or c-times (as long as c is small enough).
However the heavy operation here is sorting, as it takes cca n*log(n) time, which is larger than O(n). That means, even if you run through the for loop twice in the 2nd solution, it will be still much better than sorting once.
Please note, that accessing dictionary key takes approximately O(1) time, so the time is still O(n) time with the loop.
Refer to: https://wiki.python.org/moin/TimeComplexity
The basic solution may be better for a reader, as it's very simple and straight forward, however it's more complex.
Disclaimer: I am not familiar with python.
There are two loops you are not accounting for in the first example. Each of those sort() calls would have at least two nested loops to implement the sorting. On top of that, usually the best performance you can get in the general case is O(n log(n)) when doing sorting.
The second case avoids all sorting and simply uses a "playcard" to mark what is present. Additionally, it uses dictionary which is a hash table. I am sure you have already learned that hash tables offer constant time - O(1) - operations.
Simpler does not always mean most efficient. Conversely, efficient is often hard to comprehend.

Simple ordering for a linked list

I want to create a doubly linked list with an order sequence (an integer attribute) such that sorting by the order sequence could create an array that would effectively be equivalent to the linked list.
given: a <-> b <-> c
a.index > b.index
b.index > c.index
This index would need to handle efficiently arbitrary numbers of inserts.
Is there a known algorithm for accomplishing this?
The problem is when the list gets large and the index sequence has become packed. In that situation the list has to be scanned to put slack back in.
I'm just not sure how this should be accomplished. Ideally there would be some sort of automatic balancing so that this borrowing is both fast and rare.
The naive solution of changing all the left or right indecies by 1 to make room for the insert is O(n).
I'd prefer to use integers, as I know numbers tend to get less reliable in floating point as they approach zero in most implementations.
This is one of my favorite problems. In the literature, it's called "online list labeling", or just "list labeling". There's a bit on it in wikipedia here: https://en.wikipedia.org/wiki/Order-maintenance_problem#List-labeling
Probably the simplest algorithm that will be practical for your purposes is the first one in here: https://www.cs.cmu.edu/~sleator/papers/maintaining-order.pdf.
It handles insertions in amortized O(log N) time, and to manage N items, you have to use integers that are big enough to hold N^2. 64-bit integers are sufficient in almost all practical cases.
What I wound up going for was a roll-my-own solution, because it looked like the algorithm wanted to have the entire list in memory before it would insert the next node. And that is no good.
My idea is to borrow some of the ideas for the algorithm. What I did was make Ids ints and sort orders longs. Then the algorithm is lazy, stuffing entries anywhere they'll fit. Once it runs out of space in some little clump somewhere it begins a scan up and down from the clump and tries to establish an even spacing such that if there are n items scanned they need to share n^2 padding between them.
In theory this will mean over time the list will be perfectly padded, and given that my IDs are ints and my sort orders are longs, there will never be a scenario where you will not be able to achieve n^2 padding. I can't speak to the upper bounds on the number of operations, but my guts tell me that by doing polynomial work at 1/polynomial frequency, that I'll be doing just fine.

Data Structure for tuple indexing

I need a data structure that stores tuples and would allow me to do a query like: given tuple (x,y,z) of integers, find the next one (an upped bound for it). By that I mean considering the natural ordering (a,b,c)<=(d,e,f) <=> a<=d and b<=e and c<=f. I have tried MSD radix sort, which splits items into buckets and sorts them (and does this recursively for all positions in the tuples). Does anybody have any other suggestion? Ideally I would like the abouve query to happen within O(log n) where n is the number of tuples.
Two options.
Use binary search on a sorted array. If you build the keys ( assuming 32bit int)' with (a<<64)|(b<<32)|c and hold them in a simple array, packed one beside the other, you can use binary search to locate the value you are searching for ( if using C, there is even a library function to do this), and the next one is simply one position along. Worst case Performance is O(logN), and if you can do http://en.wikipedia.org/wiki/Interpolation_search then you might even approach O(log log N)
Problem with binary keys is might be tricky to add new values, might need gyrations if you will exceed available memory. But it is fast, only a few random memory accesses on average.
Alternatively, you could build a hash table by generating a key with a|b|c in some form, and then have the hash data pointing to a structure that contains the next value, whatever that might be. Possibly a little harder to create in the first place as when generating the table you need to know the next value already.
Problems with hash approach are it will likely use more memory than binary search method, performance is great if you don't get hash collisions, but then starts to drop off, although there a variations around this algorithm to help in some cases. Hash approach is possibly much easier to insert new values.
I also see you had a similar question along these lines, so I guess the guts of what I am saying is combine A,b,c to produce a single long key, and use that with binary search, hash or even b-tree. If the length of the key is your problem (what language), could you treat it as a string?
If this answer is completely off base, let me know and I will see if I can delete this answer, so you questions remains unanswered rather than a useless answer.

Understanding the Count Sketch data structure and associated algorithms

Working on wrapping my head around the CountSketch data structure and its associated algorithms. It seems to be a great tool for finding common elements in streaming data, and the additive nature of it makes for some fun properties with finding large changes in frequency, perhaps similar to what Twitter uses for trending topics.
The paper is a little difficult to understand for someone that has been away from more academic approaches for a while, and a previous post here did help some, for me at least it still left quite a few questions.
As I understand it, the Count Sketch structure is similar to a bloom filter. However the selection of hash functions has me confused. The structure is an N by M table with N hash functions with M possible values determining the "bucket" to alter, and another hash function s for each N that is "pairwise independent"
Are the hashes to be selected from a universal hashing family, say something of the h(x) = ((ax+b) % some_prime) % M?
And if so, where are the s hashes that return either +1 or -1 chosen from? And what is the reason for ever subtracting from one of the buckets?
They subtract from the buckets to make average effect of additions/subtractions caused by other occurrences to be 0. If half the time I add the count of 'foo', and half the time I subtract the count of 'foo', then in expectation, the count of 'foo' does not influence the estimate of the count for 'bar'.
Picking a universal hash function like you describe will indeed work, but it's mostly important for the theory rather than the practice. Salting your favorite reasonable hash function will work too, you just can't meaningfully write proofs based on the expected values using a few fixed hash functions.

how to create a collection with O(1) complexity

I would like to create a data structure or collection which will have O(1) complexity in adding, removing and calculating no. of elements. How am I supposed to start?
I have thought of a solution: I will use a Hashtable and for each key / value pair inserted, I will have only one hash code, that is: my hash code algorithm will generate a unique hash value every time, so the index at which the value is stored will be unique (i.e. no collisions).
Will that give me O(1) complexity?
Yes that will work, but as you mentioned your hashing function needs to be 100% unique. Any duplicates will result in you having to use some sort of conflict resolution. I would recommend linear chaining.
edit: Hashmap.size() allows for O(1) access
edit 2: Respopnse to the confusion Larry has caused =P
Yes, Hashing is O(k) where k is the keylength. Everyone can agree on that. However, if you do not have a perfect hash, you simply cannot get O(1) time. Your claim was that you do not need uniqueness to acheive O(1) deletion of a specific element. I guarantee you that is wrong.
Consider a worst case scenario: every element hashes to the same thing. You end up with a single linked list which as everyone knows does not have O(1) deletion. I would hope, as you mentioned, nobody is dumb enough to make a hash like this.
Point is, uniqueness of the hash is a prerequisite for O(1) runtime.
Even then, though, it is technically not O(1) Big O efficiency. Only using amortized analysis you will acheive constant time efficiency in the worst case. As noted on wikipedia's article on amortized analysis
The basic idea is that a worst case operation can alter the state in such a way that the worst case cannot occur again for a long time, thus "amortizing" its cost.
That is referring to the idea that resizing your hashtable (altering the state of your data structure) at certain load factors can ensure a smaller chance of collisions etc.
I hope this clears everything up.
Adding, Removing and Size (provided it is tracked separately, using a simple counter) can be provided by a linked list. Unless you need to remove a specific item. You should be more specific about your requirements.
Doing a totally non-clashing hash function is quite tricky even when you know exactly the space of things being hashed, and it's impossible in general. It also depends deeply on the size of the array that you're hashing into. That is, you need to know exactly what you're doing to make that work.
But if you instead relax that a bit so that identical hash codes don't imply equality1, then you can use the existing Java HashMap framework for all the other parts. All you need to do is to plug in your own hashCode() implementation in your key class, which is something that Java has always supported. And make sure that you've got equality defined right too. At that point, you've got the various operations being not much more expensive than O(1), especially if you've got a good initial estimation for the capacity and load factor.
1 Equality must imply equal hash codes, of course.
Even if your hashcodes are unique this doesn't guarentee a collision free collection. This is because your hash map is not of an unlimited size. The hashcode has to be reduced to the number of buckets in your hash map and after this reduction you can still get collisions.
e.g. Say I have three objects A (hash: 2), B (hash: 18), C (hash: 66) All unique.
Say you put them in a HashMap of with a capacity of 16 (the default). If they were mapped to a bucket with % 16 (actually is more complex that this) after reducing the hash codes we now have A (hash: 2 % 16 = 2), B (hash: 18 % 16 = 2), C (hash: 66 % 16 = 2)
HashMap is likely to be faster than Hashtable, unless you need thread safety. (In which case I suggest you use CopncurrentHashMap)
IMHO, Hashtable has been a legacy collection for 12 years, and I would suggest you only use it if you have to.
What functionality do you need that a linked list won't give you?
Surprisingly, your idea will work, if you know all the keys you want to put in the collection in advance. The idea is to generate a special hash function which maps each key to a unique value in the range (1, n). Then our "hash table" is just a simple array (+ an integer to cache the number of elements)
Implementing this is not trivial, but it's not rocket science either. I'll leave it to Steve Hanov to explain the ins-and-outs, as he gives a much better explanation than I ever could.
It's simple. Just use a hash map. You don't need to do anything special. Hashmap itself is O(1) for insertion, deletion, calculating number of elements.
Even if the keys are not unique, the algorithm will still be O(1) as long as the Hashmap is automatically expanded in size if the collection gets too large (most implementations will do this for you automatically).
So, just use the Hash map according to the given documentation, and all will be well. Don't think up anything more complicated, it will just be a waste of time.
Avoiding collisions is really impossible with a hash .. if it was possible, then it would basically just be an array or a mapping to an array, not a hash. But it isn't necessary to avoid collisions, it will still be O(1) with collisions.

Resources