I found about Cuckoo Hash tables and they seem pretty good.
But most examples code I found implement this using 2 tables.
This seems to me wrong because the 2 tables may be in different memory pages and we have overhead of fetching random addresses and have no real locality.
Is it not possible to use 1 array instead of 2?
Is it perhaps not possible to detect when an element has already been kicked out 2 times and is time for resizing?
You can definitely do a cuckoo hashtable with a single hash table; that is, where the two positions for each object are simply positions within a single hash table.
The only small problem to be solved is how to decide during the cuckoo eviction loop which of the two positions to use for an evicted key. Of course, you can just try one position and use the other one if the first one was the same as the actual position. It should be possible to use SIMD to compute both hashes in parallel, so the cost of this strategy might be small.
However, if you want to guarantee a single hash computation during the cuckoo loop, there is a simple solution: instead of using H0(k) and H1(k) as the two positions, use H0(k) and H0(k) xor H1(k). (If H1 is independent from H0, then so is H0 xor H1, so the xor does not affect the distribution of hash values.) With this modification, you can always find "the other position" of k by xor'ing the current position with H1(k), so only a single hash computation is needed in the loop.
While that allows you to use a single hash table, and may even simplify the code, there is not a lot of evidence that it improves the operation of the algorithm. In my limited testing, it seems to increase the number of loop iterations by 40-50%. (Although it needs to be emphasized that in the vast majority of cases, a new key can be inserted into the table without entering the loop at all, so the increased number of loops is hardly noticeable in the actual execution time.)
To answer the confusion in the comments: no this is not language specific. If you're thinking about memory locality and want to ensure the two tables are close, then a single allocation is the way to go (however you allocate). In java this may look as follows:
class TwoTables {
private static final int SIZE_TABLE_FIRST = 11, SIZE_TABLE_SECOND = 29;
public TwoTables() {
m_buffer = new int[SIZE_TABLE_FIRST + SIZE_TABLE_SECOND];
}
// consider similar setters...
public int getFirst(int key) {
return m_buffer[toIndex(hashFirst(key), SIZE_TABLE_FIRST, 0)];
}
public int getSecond(int key) {
return m_buffer[toIndex(hashSecond(key), SIZE_TABLE_SECOND, SIZE_TABLE_FIRST)];
}
private static int toIndex(int hash, int mod, int offset) {
return hash % mod + offset;
}
private static int hashFirst(int key) { return ...; }
private static int hashSecond(int key) { return ...; }
private final int[] m_buffer;
}
If this performs better than accessing into two separate arrays is dependant on your JVM however: just think about the JIT being able to merge two small allocations into a single larger one on the fly - without you having to perform any index-magic.
Well, all forms of hashing are murder on caches.
Anyways you can easily combine the two into a single table. But then how do you tell whether you're on your first hash function or the second? The options are add that as metadata to every bucket, or else figure it out by running the first hash function, seeing whether you got the current location, and running the second only if you were on the first. That either requires extra space, or running more hash functions.
Splitting the table into 2 solves that problem more efficiently. And statistically you need the same number of buckets to store the same number of things whether or not the table has been split. So your whole hash table becomes smaller.
Yes.
http://www.spoj.com/problems/CUCKOO/
You can check this problem on spoj,we need to do this problem using a single hash table and two hash functions.
Related
For resolving hashing collision in the Hash Table data structure, we have one very popular strategy called Separate Chaining.
I'm aware, that in the Separate Chaining strategy, keys, which end up being collided into backing array's same index (due to the fact, that they're hashed into the same particular values), are Linked Lists.
I wonder whether the type of backing array is LinkedList<E>[] from the moment of creation of Hash Table (during separate chaining strategy implementation), or it's int[] and it gets converted to the LinkedList<E>[] array after first collision?
Because, having Linked Lists as each element of the backing array seems not the most optimal solution.. it means, that those Linked Lists, should be a list of the elements, which in turn, are Entries/Buckets of a pair of key-value.. and this all really consumes a lot of memory and resource, I reckon.
I did quite a research in different books and academic articles; yet, I still can't really get a clear answer on this.
Yes, separate chaining will cost more memory than probing or re-hashing. But the benefit is that you get more items in the hash table before performance begins to suffer. At some point you still have to re-index: typically when you realize that some bucket is over-represented or when the total number of occupied buckets exceeds some threshold.
Note that the backing array itself isn't a linked list. The backing array for a hash table that uses probing or re-hashing will probably be a dynamically-sized array of entries. Your entry would be something like:
class Entry {
String: key;
SomeObject: value;
}
If you're using separate chaining, the Entry object gets an additional field: a reference to the next item that hashed to the same bucket:
class Entry {
String: key;
SomeObject: value;
Entry: next;
}
The memory difference for the first item really isn't enough to worry about.
It's possible to write the code so that if a bucket has but a single item, it will contain just the key and value, and the bucket is converted to a linked list only on first collision. There is perhaps a small memory win there, and an even smaller performance gain. But the code is more complex and the gains aren't huge unless you know that the majority of your buckets won't have any collisions. Not worth the trouble of implementing, testing, and maintaining two different code paths.
I have a type that is used for updating the UI. This is the pseudo code for it.
{
Item[] AllItems;
long[] ItemsRemovedFromPreviousUpdateByIndex;
long[] ItemsAddedToPreviousUpdateByIndex;
}
These update are quite frequent, and contain a lot of data. We want to aggregate these instances and deliver just one update every 200ms to the UI. There may be 20 or 30 updates per 200ms window.
My question is whether there's a good way to aggregate the indices. I can't think of a way to do it without allocating a lot of memory.
I would try to store that data in a pair of ints, longs, or BigIntegers, depending on how many indices you have.
First think about doing something like this:
boolean[] added = new boolean[allItems.length];
boolean[] removed = new boolean[allItems.length];
Where the booleans are true if and only if the item at the corresponding index was not added/removed.
But then rather than doing this, I would try to represent the boolean array as a long, or if needed, a BigInteger. Since, tttfftfftf --> 1110010010 --> 914.
You can look into methods for flipping a single bit from 0 to 1, and if it's already a 1, leave it alone. You'll need to be clever in your init case, since you're starting out with all 0s (which is 0) and the first index might be large, so you would need to go from 0 to 10...0 in one step, which is just 1 << n. From there you are just flipping single bits, and possibly shifting, if you need to flip the bit at j > n where n was from the init-case mentioned above.
Keep in mind that you will always need to know the length of the allItems array at each reset, since 00001001 is the same as 1001, so if the length is 8, you should interpret 1001 as 00001001.
At some point we need to increase the size of hash, and normally we just rehash, which leads to re-constructure of the whole hash.
Is there any better solution so that when we increase the size, we don't need to re-construct the whole thing?
You could use http://en.wikipedia.org/wiki/Extendible_hashing, although AFAIK it is used mostly for on-disk databases.
There are also general methods for smoothing out some amortised costs. Starting points for this would be http://en.wikipedia.org/wiki/Static_and_dynamic_data_structures and http://en.wikipedia.org/wiki/Dynamization. One application of this to hash tables would be to always keep two tables, one of size N and one of size 2N or so. When the smaller overflows, start creating a table of size 4N, but don't populate it straight away - populate it incrementally while using the table of size 2N. By the time the table of size 2N is full, the table of size 4N should be ready. For the special case of hash tables, extendible hashing should be better.
Any time you re-hash, there's nothing that says you need to actually re-hash. In fact all that you actually need to do is re-mod (i.e. shift everything's position).
If you cache the hash (hehe, sounds like the start of a dr. seuss book) then you only need to compute it once. So store the hash along with the actual data, and that will save you from needing to calculate the hash again in the future. However I'm assuming that you're not already doing this, you didn't exactly explain the current process.
// Store these instead of the data directly. This assumes immutable data.
struct hashable_item
{
data dat;
int32 hash;
}
So I've been reading up on Hashing for my final exam, and I just cannot seem to grasp what is happening. Can someone explain Hashing to me the best way they understand it?
Sorry for the vague question, but I was hoping you guys would just be able to say "what hashing is" so I at least have a start, and if anyone knows any helpful ways to understand it, that would be helpful also.
Hashing is a fast heuristic for finding an object's equivalence class.
In smaller words:
Hashing is useful because it is computationally cheap. The cost is independent of the size of the equivalence class. http://en.wikipedia.org/wiki/Time_complexity#Constant_time
An equivalence class is a set of items that are equivalent. Think about string representations of numbers. You might say that "042", "42", "42.0", "84/2", "41.9..." are equivalent representations of the same underlying abstract concept. They would be in the same equivalence class. http://en.wikipedia.org/wiki/Equivalence_class
If I want to know whether "042" and "84/2" are probably equivalent, I can compute hashcodes for each (a cheap operation) and only if the hash codes are equal, then I try the more expensive check. If I want to divide representations of numbers into buckets, so that representations of the same number are in the buckets, I can choose bucket by hash code.
Hashing is heuristic, i.e. it does not always produce a perfect result, but its imperfections can be mitigated for by an algorithm designer who is aware of them. Hashing produces a hash code. Two different objects (not in the same equivalence class) can produce the same hash code but usually don't, but two objects in the same equivalence class must produce the same hash code. http://en.wikipedia.org/wiki/Heuristic#Computer_science
Hashing is summarizing.
The hash of the sequence of numbers (2,3,4,5,6) is a summary of those numbers. 20, for example, is one kind of summary that doesn't include all available bits in the original data very well. It isn't a very good summary, but it's a summary.
When the value involves more than a few bytes of data, some bits must get rejected. If you use sum and mod (to keep the sum under 2billion, for example) you tend to keep a lot of right-most bits and lose all the left-most bits.
So a good hash is fair -- it keeps and rejects bits equitably. That tends to prevent collisions.
Our simplistic "sum hash", for example, will have collisions between other sequences of numbers that also happen to have the same sum.
Firstly we should say about the problem to be solved with Hashing algorithm.
Suppose you have some data (maybe an array, or tree, or database entries). You want to find concrete element in this datastore (for example in array) as much as faster. How to do it?
When you are built this datastore, you can calculate for every item you put special value (it named HashValue). The way to calculate this value may be different. But all methods should satisfy special condition: calculated value should be unique for every item.
So, now you have an array of items and for every item you have this HashValue. How to use it? Consider you have an array of N elements. Let's put your items to this array according to their HashHalues.
Suppose, you are to answer for this question: Is the item "it1" exists in this array? To answer to it you can simply find the HashValue for "it1" (let's call it f("it1")) and look to the Array at the f("it1") position. If the element at this position is not null (and equals to our "it1" item), our answer is true. Otherwise answer is false.
Also there exist collisions problem: how to find such coolest function, which will give unique HashValues for all different elements. Actually, such function doesn't exist. There are a lot of good functions, which can give you good values.
Some example for better understanding:
Suppose, you have an array of Strings: A = {"aaa","bgb","eccc","dddsp",...}. And you are to answer for the question: does this array contain String S?
Firstle, we are to choose function for calculating HashValues. Let's take the function f, which has this meaning - for a given string it returns the length of this string (actually, it's very bad function. But I took it for easy understanding).
So, f("aaa") = 3, f("qwerty") = 6, and so on...
So now we are to calculate HashValues for every element in array A: f("aaa")=3, f("eccc")=4,...
Let's take an array for holding this items (it also named HashTable) - let's call it H (an array of strings). So, now we put our elements to this array according to their HashValues:
H[3] = "aaa", H[4] = "eccc",...
And finally, how to find given String in this array?
Suppose, you are given a String s = "eccc". f("eccc") = 4. So, if H[4] == "eccc", our answer will be true, otherwise it fill be false.
But how to avoid situations, when to elements has same HashValues? There are a lot of ways to it. One of this: each element in HashTable will contain a list of items. So, H[4] will contain all items, which HashValue equals to 4. And How to find concrete element? It's very easy: calculate fo this item HashValue and look to the list of items in HashTable[HashValue]. If one of this items equals to our searching element, answer is true, owherwise answer is false.
You take some data and deterministically, one-way calculate some fixed-length data from it that totally changes when you change the input a little bit.
a hash function applied to some data generates some new data.
it is always the same for the same data.
thats about it.
another constraint that is often put on it, which i think is not really true, is that the hash function requires that you cannot conclude to the original data from the hash.
for me this is an own category called cryptographic or one way hashing.
there are a lot of demands on certain kinds of hash f unctions
for example that the hash is always the same length.
or that hashes are distributet randomly for any given sequence of input data.
the only important point is that its deterministic (always the same hash for the same data).
so you can use it for eample verify data integrity, validate passwords, etc.
read all about it here
http://en.wikipedia.org/wiki/Hash_function
You should read the wikipedia article first. Then come with questions on the topics you don't understand.
To put it short, quoting the article, to hash means:
to chop and mix
That is, given a value, you get another (usually) shorter value from it (chop), but that obtained value should change even if a small part of the original value changes (mix).
Lets take x % 9 as an example hashing function.
345 % 9 = 3
355 % 9 = 4
344 % 9 = 2
2345 % 9 = 5
You can see that this hashing method takes into account all parts of the input and changes if any of the digits change. That makes it a good hashing function.
On the other hand if we would take x%10. We would get
345 % 10 = 5
355 % 10 = 5
344 % 10 = 4
2345 % 10 = 5
As you can see most of the hashed values are 5. This tells us that x%10 is a worse hashing function than x%9.
Note that x%10 is still a hashing function. The identity function could be considered a hash function as well.
I'd say linut's answer is pretty good, but I'll amplify it a little. Computers are very good at accessing things in arrays. If I know that an item is in MyArray[19], I can access it directly. A hash function is a means of mapping lookup keys to array subscripts. If I have 193,372 different strings stored in an array, and I have a function which will return 0 for one of the strings, 1 for another, 2 for another, etc. up to 193,371 for the last one, I can see if any given string is in the array by running that function and then seeing if the given string matches the one in that spot in the array. Nice and easy.
Unfortunately, in practice, things are seldom so nice and tidy. While it's often possible to write a function which will map inputs to unique integers in a nice easy range (if nothing else:
if (inputstring == thefirststring) return 0;
if (inputstring == thesecondstring) return 1;
if (inputstring == thethirdstring) return 1;
... up to the the193371ndstring
in many cases, a 'perfect' function would take so much effort to compute that it wouldn't be worth the effort.
What is done instead is to design a system where a hash function says where one should start looking for the data, and then some other means is used to search for the data from there. A few common approaches are:
Linear hashing -- If two items map to the same hash value, store one of them in the array slot following the one indicated by the hash code. When looking for an item, search in the indicated slot, and then next one, then the next, etc. until the item is found or one hits an empty slot. Linear hashing is simple, but it works poorly unless the table is much bigger than the number of items in it (leaving lots of empty slots). Note also that deleting items from such a hash table can be difficult, since the existence of an item may have prevented some other item from going into its indicated spot.
Double hashing -- If two items map to the same value, compute a different hash value for the second one added, and shove the second item that many slots away (if that slot is full, keep stepping by that increment until a vacant slot is found). If the hash values are independent, this approach can work well with a more-dense table. It's even harder to delete items from such a table, though, than with a linear hash table, since there's no nice way to find items which were displaced by the item to be deleted.
Nested hashing -- Each slot in the hash table contains a hash table using a different function from the main table. This can work well if the two hash functions are independent, but is apt to work very poorly if they aren't.
Chain-bucket hashing -- Each slot in the hash table holds a list of things that map to that hash value. If N things map to a particular slot, finding one of them will take time O(N). If the hash function is decent, however, most non-empty slots will contain only one item, most of those with more than that will contain only two, etc. so no slot will hold very many items.
When dealing with a fixed data set (e.g. a compiler's set of keywords), linear hashing is often good; in cases where it works badly, one can tweak the hash function so it will work well. When dealing with an unknown data set, chain bucket hashing is often the best approach. The overhead of dealing with extra lists may make it more expensive than double hashing, but it's far less likely to perform really horribly.
When a user adds a new item in my system, I want to produce a unique non-incrementing pseudo-random 7-digit code for that item. The number of items created will only number in the thousands (<10,000).
Because it needs to be unique and no two items will have the same information, I could use a hash, but it needs to be a code they can share with other people - hence the 7 digits.
My original thought was just to loop the generation of a random number, check that it wasn't already used, and if it was, rinse and repeat. I think this is a reasonable if distasteful solution given the low likelihood of collisions.
Responses to this question suggest generating a list of all unused numbers and shuffling them. I could probably keep a list like this in a database, but we're talking 10,000,000 entries for something relatively infrequent.
Does anyone have a better way?
Pick a 7-digit prime number A, and a big prime number B, and
int nth_unique_7_digit_code(int n) {
return (n * B) % A;
}
The count of all unique codes generated by this will be A.
If you want to be more "secure", do pow(some_prime_number, n) % A, i.e.
static int current_code = B;
int get_next_unique_code() {
current_code = (B * current_code) % A;
return current_code;
}
You could use an incrementing ID and then XOR it on some fixed key.
const int XORCode = 12345;
private int Encode(int id)
{
return id^XORCode;
}
private int Decode(int code)
{
return code^XORCode;
}
Honestly, if you want to generate only a couple of thousand 7-digit codes, while 10 million different codes will be available, I think just generating a random one and checking for a collision is good enough.
The chance of a collision on the first hit will be, in the worst case scenario, about 1 in a thousand, and the computational effort to just generate a new 7-digit code and check for a collision again will be much smaller than keeping a dictionary, or similar solutions.
Using a GUID instead of a 7-digit code as harryovers suggested will also certainly work, but of course a GUID will be slightly harder to remember for your users.
i would suggest using a guid instead of a 7 digit code as it will be more unique and you don't have to worry about generateing them as .NET will do this for you.
All solutions for a "unique" ID must have a database somewhere: Either one which contains the used IDs or one with the free IDs. As you noticed, the database with free IDs will be pretty big so most often, people use a "used IDs" database and check for collisions.
That said, some databases offer a "random ID" generator/sequence which already returns IDs in a range in random order.
This works by using a random number generator which can create all numbers in a range without repeating itself plus the feature that you can save it's state somewhere. So what you do is run the generator once, use the ID and save the new state. For the next run, you load the state and reset the generator to the last state to get the next random ID.
I assume you'll have a table of the generated ones. In that case, I don't see a problem with picking random numbers and checking them against the database, but I wouldn't do it individually. Generating them is cheap, doing the DB query is expensive relative to that. I'd generate 100 or 1,000 at a time and then ask the DB which of those exists. Bet you won't have to do it twice most of the time.
You have <10.000 items, so you need only 4 digits to store a unique number for all items.
Since you have 7 digits, you have 3 digits extra.
If you combine a unique sequence number of 4 digits with a random number of 3 digits, you will be unique and random. You increment the sequence number with every new ID you generate.
You can just append them in any order, or mix them.
seq = abcd,
rnd = ABC
You can create the following ID's:
abcdABC
ABCabcd
aAbBcCd
If you use only one mixing algorithm, you will have unique numbers, that look random.
I would try to use an LFSR (Linear feedback shift register) the code is really simple you can find examples everywhere ie Wikipedia and even though it's not cryptographically secure it looks very random. Also the implementation will be very fast since it's using mainly shift operations.
With only thousands of items in the database, your original idea seems sound. Checking the existance of a value in a sorted (indexed) list of a few tens of thousands of items would only require a few data fetches and comparisons.
Pre-generating the list doesn't sound like a good idea, because you will either store way more numbers than are necessary, or you will have to deal with running out of them.
Probability of having hits is very low.
For instance - you have 10^4 users and 10^7 possible IDs.
Probability that you pick used ID 10 times in row is now 10^-30.
This chance is lower than once in a lifetime of any person.
Well, you could ask the user to pick their own 7-digit number and validate it against the population of existing numbers (which you would have stored as they were used up), but I suspect you would be filtering a lot of 1234567, 7654321, 9999999, 7777777 type responses and might need a few RegExs to achieve the filtering, plus you'd have to warn the user against such sequences in order not to have a bad, repetitive, user input experience.