I'm taking a data-structure class, and the lecturer made the following assertion:
the number of attempts needed to insert n keys in a hash table with linear probing is independent of their order.
No proof was given, so I tried to get one myself. However, I'm stuck.
My approach at the moment: I try to show that if I swap two adjacent keys the number of attempts doesn't change. I get the idea behind it, and I think it's going in the right direction, but I can't manage to make it into a rigorous proof.
Aside, does this fact also hold for other probing techniques such as quadratic or double hashing?
Related
I want to create a doubly linked list with an order sequence (an integer attribute) such that sorting by the order sequence could create an array that would effectively be equivalent to the linked list.
given: a <-> b <-> c
a.index > b.index
b.index > c.index
This index would need to handle efficiently arbitrary numbers of inserts.
Is there a known algorithm for accomplishing this?
The problem is when the list gets large and the index sequence has become packed. In that situation the list has to be scanned to put slack back in.
I'm just not sure how this should be accomplished. Ideally there would be some sort of automatic balancing so that this borrowing is both fast and rare.
The naive solution of changing all the left or right indecies by 1 to make room for the insert is O(n).
I'd prefer to use integers, as I know numbers tend to get less reliable in floating point as they approach zero in most implementations.
This is one of my favorite problems. In the literature, it's called "online list labeling", or just "list labeling". There's a bit on it in wikipedia here: https://en.wikipedia.org/wiki/Order-maintenance_problem#List-labeling
Probably the simplest algorithm that will be practical for your purposes is the first one in here: https://www.cs.cmu.edu/~sleator/papers/maintaining-order.pdf.
It handles insertions in amortized O(log N) time, and to manage N items, you have to use integers that are big enough to hold N^2. 64-bit integers are sufficient in almost all practical cases.
What I wound up going for was a roll-my-own solution, because it looked like the algorithm wanted to have the entire list in memory before it would insert the next node. And that is no good.
My idea is to borrow some of the ideas for the algorithm. What I did was make Ids ints and sort orders longs. Then the algorithm is lazy, stuffing entries anywhere they'll fit. Once it runs out of space in some little clump somewhere it begins a scan up and down from the clump and tries to establish an even spacing such that if there are n items scanned they need to share n^2 padding between them.
In theory this will mean over time the list will be perfectly padded, and given that my IDs are ints and my sort orders are longs, there will never be a scenario where you will not be able to achieve n^2 padding. I can't speak to the upper bounds on the number of operations, but my guts tell me that by doing polynomial work at 1/polynomial frequency, that I'll be doing just fine.
Quick question about hash tables.
I'm currently implementing a hash table
using a combination of separate chaining
and open addressing, limiting the length
of each bucket's linked lists to a certain length.
However, I'm having trouble thinking of a way to efficiently get/remove
with this hash table structure, and am wondering if I'm being blindly stupid
or if anyone has approached a similar issue before.
If I try to continually probe using the collision resolution scheme, I could be potentially going forever and never finding out if the key is not in the table. This is because most probing methods will not cover every bucket, and I'd rather not use linear probing.
Because most probing methods will not cover every bucket, and it is expensive to keep track of which buckets you've looked at, if a bucket is emptied but a subsequent bucket in the probing path is not, the algorithm cannot simply stop once it encounters an empty bucket.
I'd greatly appreciate any ideas on the issue.
Thanks!
In scenario like unlimited collision, we usually tend to use:
linear probing: with n jumps each time, where n is a prime number >= 7, why prime? 90% of the hastables using prime number usually iterate over every cell in the table therefore traversing the whole table instead of just jumping around each cell.
poly probling: with n jumps each time, where n is recomputed using a polynomial function such as f(x) = x^2 + 2x + 1, why? this gives a different result on each cell and isn't based of the values in the cell completely.
I was asked an interview question which asked me to return the number with the biggest repetition within an array, for example, {1,1,2,3,4} returns 1.
I first proposed a method in hashtable, which requires space complexity O(n).
Then I said sort the array first and then go through it then we can find the number.
Which requires O(NlogN)
The interviewer was still not satisfied.
Any optimization?
Thanks.
Interviewers aren't always looking for solutions per se. Sometimes they're looking to find out your capacity to do other things. You should have asked if there there were any constraints on the data such as:
is it already sorted?
are the values limited to a certain range?
This establishes your ability to think about problems rather than just blindly vomit forth stuff that you've read in textbooks. For example, if it's already sorted, you can do it in O(n) time, O(1) space simply by looking for the run with the largest size.
If it's unsorted but limited to the values 1..100, you can still do it in O(n) time, O(1) space by creating a count of each possible value, initially all set to zero, then incrementing for each item.
Then ask the interviewer other things like:
What sort of behaviour do they want if there are two numbers with the same count?
If they're not satisfied with your provided solutions, try to get a clue as to where they're thinking. Do they think it can be done in O(log N) or O(1)? Interviews are never a one-way street.
There are boundless other "solutions" like stashing the whole thing into a class so that you can perform other optimisations (such as caching the information, or using a different data structures which makes the operation faster). Discussing these with your interviewer will give them a chance to see you in action.
As an aside, I tell my children to always show working out in their school assignments. If they just plop down the answer and it's wrong, they'll get nothing. However, if they show their working out and get the wrong answer, the teacher can at least see that they had the right idea (they probably just made one little mistake along the way).
It's exactly the same thing here. If you simply say "hashtable" and the interviewer has a different idea, that'll be a pretty short interview question.
However, saying "based on unsorted arrays, no possibility of keeping the data in a different data structure, and no limitations on the data values, it would appear hashtables are the most efficient way, BUT, if there was some other information I'm not yet privy to, there might be a better method" will show that you've given it some thought, and possibly open a dialogue with the interviewer that will help you out.
Bottom line, when an interviewer asks you a question, don't always assume it's as straightforward as you initially think. Apart from tech knowledge, they may be looking to see how you approach problem solving, how you handle Kobayashi-Maru-type problems, how you'll work in a team, how you'll treat difficult customers, whether you're a closet psychopath and endless other possibilities.
I am implementing a hash table for a project, using 3 different kinds of probing. Right now I'm working on linear.
For linear probing, I understand how the probing works, and my instructor implied he wanted the step size to be 1. The thing is, no duplicates are allowed. So I have to "search" for a value before I insert it, right? But what if the table is used to the point where all the cells are either "occupied" or "deleted"? Then in order to search for a specific key to make sure it isn't in the table, I'll have to search the entire table. That means a search operation (and by extension, an insert operation) is O(n).
That doesn't seem right, and I think I misunderstood something.
I know I won't have to run into the same issue with quadratic probing, since the table needs to be at least half empty, and it will only probe a determined number of elements. And for double hashing, I'm not sure how this will work, because I'll also need to search the table to prove that the key to be inserted isn't present. But how would I know how to stop the search if none of the cells are "never occupied?"
So: In open hashing, where every entry in the table has been occupied in the past, does it take O(n) probes to search for an element (and insert, if no duplicates are allowed)?
If you misunderstand this aspect of linear probing, so do I. I agree that if the hash table is near full then performance degrades towards O(n) per insertion. See Don Knuth's 1963 analysis for all the details.
Parenthetically, I was amazed to see that the first analysis of this problem was actually done by the mathematician Ramanujan in 1913, whose results implied "that the total displacement of elements, i.e., the construction cost, for a linear probing hashing table that is full is about N^(3/2)." (see here)
In practice, however, I don't think the problem that insertion is slow is the important problem with nearly-full hash tables. The important problem is that you get to the point where you can't do another insertion at all!
Thus, any practical implementation of a hash table must have a strategy for re-sizing the hash table when it gets beyond a given load factor, with the optimal load factor for re-sizing chosen either based on theory or experiment. Using experiments is particularly valuable in this case because the performance of linear hashing can be very sensitive to the ability of the hash function to distribute items evenly across the hash table in a way that avoids clusters, which makes performance very dependent on the characteristics of the items to be inserted into the table.
I'm in the process of learning about simulated annealing algorithms and have a few questions on how I would modify an example algorithm to solve a 0-1 knapsack problem.
I found this great code on CP:
http://www.codeproject.com/KB/recipes/simulatedAnnealingTSP.aspx
I'm pretty sure I understand how it all works now (except the whole Bolzman condition, as far as I'm concerned is black magic, though I understand about escaping local optimums and apparently this does exactly that). I'd like to re-design this to solve a 0-1 knapsack-"ish" problem. Basically I'm putting one of 5,000 objects in 10 sacks and need to optimize for the least unused space. The actual "score" I assign to a solution is a bit more complex, but not related to the algorithm.
This seems easy enough. This means the Anneal() function would be basically the same. I'd have to implement the GetNextArrangement() function to fit my needs. In the TSM problem, he just swaps two random nodes along the path (ie, he makes a very small change each iteration).
For my problem, on the first iteration, I'd pick 10 random objects and look at the leftover space. For the next iteration, would I just pick 10 new random objects? Or am I best only swapping out a few of the objects, like half of them or only even one of them? Or maybe the number of objects I swap out should be relative to the temperature? Any of these seem doable to me, I'm just wondering if someone has some advice on the best approach (though I can mess around with improvements once I have the code working).
Thanks!
Mike
With simulated annealing, you want to make neighbour states as close in energy as possible. If the neighbours have significantly greater energy, then it will just never jump to them without a very high temperature -- high enough that it will never make progress. On the other hand, if you can come up with heuristics that exploit lower-energy states, then exploit them.
For the TSP, this means swapping adjacent cities. For your problem, I'd suggest a conditional neighbour selection algorithm as follows:
If there are objects that fit in the empty space, then it always puts the biggest one in.
If no objects fit in the empty space, then pick an object to swap out -- but prefer to swap objects of similar sizes.
That is, objects have a probability inverse to the difference in their sizes. You might want to use something like roulette selection here, with the slice size being something like (1 / (size1 - size2)^2).
Ah, I think I found my answer on Wikipedia.. It suggests moving to a "neighbor" state, which usually implies changing as little as possible (like swapping two cities in a TSM problem)..
From: http://en.wikipedia.org/wiki/Simulated_annealing
"The neighbours of a state are new states of the problem that are produced after altering the given state in some particular way. For example, in the traveling salesman problem, each state is typically defined as a particular permutation of the cities to be visited. The neighbours of some particular permutation are the permutations that are produced for example by interchanging a pair of adjacent cities. The action taken to alter the solution in order to find neighbouring solutions is called "move" and different "moves" give different neighbours. These moves usually result in minimal alterations of the solution, as the previous example depicts, in order to help an algorithm to optimize the solution to the maximum extent and also to retain the already optimum parts of the solution and affect only the suboptimum parts. In the previous example, the parts of the solution are the parts of the tour."
So I believe my GetNextArrangement function would want to swap out a random item with an item unused in the set..