Are there any hash functions that allow you to resize the table without also rehashing (removing + reinserting) the contents? - algorithm

Is it possible using a certain hash function and method (the division method, or double hashing) to make a chained hash table that can be resized without having to reinsert (rehash) each element already in the table?

You would still need to reinsert, but some way to make that cheaper would be to store the hash value before the modulus was applied. That way, you can save a large part of the calculation cost of rehashing.
With this approach, it would be possible to shrink the table in size as well.

I can only assume the reason you want to avoid rehashing everything is that the resulting high latency operation is not an issue to throughput but is instead a problem for responsiveness (either human or in SLA sense)
In theory you could use a modified closed addressing hash table like so:
remember all previous sizes where elements were added
On resize keep the old buckets around linked to internally via a map of sizeWhenUsed -> buckets (obviously if the buckets are empty no need to bother)
Invariant a mapping of Key k exists in only one of the 'internal hash tables' at any time.
on addition of a value you must first look it up in all the other maps to determine if the entry already exists and is mapped. If it is remove it from the old one and add it to the new one.
if an internal map becomes empty/below a certain size it should be deleted and remaining elements moved into the current hash table.
so long as the number of internal hashes is kept constant this will not impact the big O behaviour of the data structure in time, though it will in memory.
This will however affect the actual performance as X additional checks must be made where X is the number of old hashes maintained.
If the wasted space of the list of buckets (the buckets themselves will be null if empty so are zero cost unless populated) becomes significant (use a fudge factor for this) then at some point on a rehash you may have to take the hit of moving things into the current table unless you are willing to expend essentially unlimited memory.
Downgrades in size of the hash will only function in the desired manner (releasing memory) if you are willing to rehash. This is unavoidable.
It is possible you could make use of some complex additional data within an open addressing scheme to 'flag' which of the internal hashes the cell was in use by but removals would be extremely complex to get right and would be very expensive unless you just left them as wasted space. I would never attempt this.
I would not suggest using the former method either unless the underlying data spent very little time in the hash, thus the related churn would tend to steadily 'erase' the older sized hashes. It is likely that a hash tuned for just this sort of behaviour and preset with an appropriate size would perform much better though.
Since the above scheme is simply trading wasted memory and throughput for reduction in the expensive operations with speculative (at best) chance of reducing this waste I would suggest simply pre-sizing your hash to be larger than required and thus never resized would be a more sensible option.

Probably not - the hash would have to not use any variety of modulus, which would mean that it would have a required table size depending on the data anyway.

All hash tables must deal with collisions, either through chaining or probing or whatever, so, I suspect that if upon table resize you simply resized the table (IE, you don't re-insert everything), you would have a functional, though highly non optimal, hash table.

I assume you're asking this question because you want to avoid the high cost of resizing a hash table. You want a hash table which has guaranteed constant time (assuming no collision problems, of course). This can be done.
The trick is to iteratively initialize the next-size hash table while the current one is filling up. By the time you need it, it's ready.
Quick pseudo-code to add an element:
if resizing then
smallTable = bigTable
bigTable = new T[smallTable.length * 2] //if allocation zeroes memory, we lose O(1)
set state to zeroing
elseif zeroing then
zero a small amount of the bigTable memory
if done zeroing then set state to transfering
elseif transfering then
transfer a few values in the small table to the big table
if done transfering then set state to resizing
end if
add new item to small array
add new item to large array

Related

When to use hash tables?

What are the cases when using hash table can improve performance, and when it does not? and what are the cases when using hash tables are not applicable?
What are the cases when using hash table can improve performance, and when it does not?
If you have reason to care, implement using hash tables and whatever else you're considering, put your actual data through, and measure which performs better.
That said, if the hash tables has the operations you need (i.e. you're not expecting to iterate it in sorted order, or compare it quickly to another hash table), and has millions or more (billions, trillions...) of elements, then it'll probably be your best choice, but a lot depends on the hash table implementation (especially the choice of closed vs. open hashing), object size, hash function quality and calculation cost / runtime), comparison cost, oddities of your computers memory performance at different cache levels... in short: too many things to make even an educated guess a better choice than measuring, when it matters.
and what are the cases when using hash tables are not applicable?
Mainly when:
The input can't be hashed (e.g. you're given binary blobs and don't know which bits in there are significant, but you do have an int cmp(const T&, const T&) function you could use for a std::map), or
the available/possible hash functions are very collision prone, or
you want to avoid worst-case performance hits for:
handling lots of hash-colliding elements (perhaps "engineered" by someone trying to crash or slow down your software)
resizing the hash table: unless presized to be large enough (which can be wasteful and slow when excessive memory's used), the majority of implementations will outgrow the arrays they're using for the hash table every now and then, then allocate a bigger array and copy content across: this can make the specific insertions that cause this rehashing to be much slower than the normal O(1) behaviour, even though the average is still O(1); if you need more consistent behaviour in all cases, something like a balance binary tree may serve
your access patterns are quite specialised (e.g. frequently operating on elements with keys that are "nearby" in some specific sort order), such that cache efficiency is better for other storage models that keep them nearby in memory (e.g. bucket sorted elements), even if you're not exactly relying on the sort order for e.g. iteration
We use Hash Tables to get access time of O(1). Imagine a dictionary. When you are looking for a word, eg "happy", you jump straight to 'H'. Here the hash function is determined by the starting alphabet. And then you look for happy within the H bucket (actually H bucket then HA bucket then HAP bucket anbd so on).
It doesn't make sense to use Hash Tables when your data is ordered or needs ordering like sorted numbers. (Alphabets are ordered ABCD....XYZ but it wouldn't matter if you switched A and Z, provided you know it is switched in your dictionary.)

resizing tradeoffs when implementing Hashtable using linear probing

I am trying to implement a hashtable using linear probing.
Before inserting a (key, value) pair into the hashtable, I want to check if it's half full. If it is, I need to double the size of the underlying array.
Obviously, there are two ways to do that:
One is to create another array with the doubled size, rehash all entries in the old one and add them to the new array. Then, rebind the old array to the new one. This way is easy to implement but uses a lot of space.
The other one is to double the array and do the rehashing in-place. It seems that this way may lead to longer running time because rehashing may cause collisions with both newly hashed slots and old slots.
Which way should I use?
Your second solution only saves space during the resize process if there is in fact room to expand the existing hash table in-place - I think the chances of that being the case for a large hash table are quite slim, so I would just go for your first solution.

How to improve the performance when doing rehash?

At some point we need to increase the size of hash, and normally we just rehash, which leads to re-constructure of the whole hash.
Is there any better solution so that when we increase the size, we don't need to re-construct the whole thing?
You could use http://en.wikipedia.org/wiki/Extendible_hashing, although AFAIK it is used mostly for on-disk databases.
There are also general methods for smoothing out some amortised costs. Starting points for this would be http://en.wikipedia.org/wiki/Static_and_dynamic_data_structures and http://en.wikipedia.org/wiki/Dynamization. One application of this to hash tables would be to always keep two tables, one of size N and one of size 2N or so. When the smaller overflows, start creating a table of size 4N, but don't populate it straight away - populate it incrementally while using the table of size 2N. By the time the table of size 2N is full, the table of size 4N should be ready. For the special case of hash tables, extendible hashing should be better.
Any time you re-hash, there's nothing that says you need to actually re-hash. In fact all that you actually need to do is re-mod (i.e. shift everything's position).
If you cache the hash (hehe, sounds like the start of a dr. seuss book) then you only need to compute it once. So store the hash along with the actual data, and that will save you from needing to calculate the hash again in the future. However I'm assuming that you're not already doing this, you didn't exactly explain the current process.
// Store these instead of the data directly. This assumes immutable data.
struct hashable_item
{
data dat;
int32 hash;
}

Hash table optimized for full iteration + key replacement

I have a hash table where the vast majority of accesses at run-time follow one of the following patterns:
Iterate through all key/value pairs. (The speed of this operation is critical.)
Modify keys (i.e. remove a key/value pair & add another with the same value but a different key. Detect duplicate keys & combine values if necessary.) This is done in a loop, affecting many thousands of keys, but with no other operations intervening.
I would also like it to consume as little memory as possible.
Other standard operations must be available, though they are used less frequently, e.g.
Insert a new key/value pair
Given a key, look up the corresponding value
Change the value associated with an existing key
Of course all "standard" hash table implementations, including standard libraries of most high-level-languages, have all of these capabilities. What I am looking for is an implementation that is optimized for the operations in the first list.
Issues with common implementations:
Most hash table implementations use separate chaining (i.e. a linked list for each bucket.) This works but I am hoping for something that occupies less memory with better locality of reference. Note: my keys are small (13 bytes each, padded to 16 bytes.)
Most open addressing schemes have a major disadvantage for my application: Keys are removed and replaced in large groups. That leaves deletion markers that increase the load factor, requiring the table to be re-built frequently.
Schemes that work, but are less than ideal:
Separate chaining with an array (instead of a linked list) per bucket:
Poor locality of reference, resulting from memory fragmentation as small arrays are reallocated many times
Linear probing/quadratic hashing/double hashing (with or without Brent's Variation):
Table quickly fills up with deletion markers
Cuckoo hashing
Only works for <50% load factor, and I want a high LF to save memory and speed up iteration.
Is there a specialized hashing scheme that would work well for this case?
Note: I have a good hash function that works well with both power-of-2 and prime table sizes, and can be used for double hashing, so this shouldn't be an issue.
Would Extendable Hashing help? Iterating though the keys by walking the 'directory' should be fast. Not sure if the "modify key for value" operation is any better with this scheme or not.
Based on how you're accessing the data, does it really make sense to use a hash table at all?
Since you're main use cases involve iteration - a sorted list or a btree might be a better data structure.
It doesnt seem like you really need the constant time random data access a hash table is built for.
You can do much better than a 50% load factor with cuckoo hashing.
Two hash functions with four items will get you over 90% with little effort. See this paper:
http://www.ru.is/faculty/ulfar/CuckooHash.pdf
I'm building a pre-computed dictionary using a cuckoo hash and getting a load factor of better than 99% with two hash functions and seven items per bucket.

Caching vector addition over changing collections

I have the following setup:
I have a largish number of uuids (currently about 10k but expected to grow unboundedly - they're user IDs) and a function f : id -> sparse vector with 32-bit integer values (no need to worry about precision). The function is reasonably expensive (not outrageously so, but probably on the order of a few 100ms for a given id). The dimension of the sparse vectors should be assumed to be infinite, as new dimensions can appear over time, but in practice is unlikely to ever exceed about 20k (and individual results of f are unlikely to have more than a few hundred non-zero values).
I want to support the following operations efficiently:
add a new ID to the collection
invalidate an existing ID
retrieve sum f(id) in O(changes since last retrieval)
i.e. I want to cache the sum of the vectors in a way that's reasonable to do incrementally.
One option would be to support a remove ID operation and treat invalidation as a remove followed by an add. The problem with this is that it requires us to keep track of all the old values of f, which is expensive in space. I potentially need to use many instances of this sort of cached structure, so I would like to avoid that.
The likely usage pattern is that new IDs are added at a fairly continuous rate and are frequently invalidated at first. Ids which have been invalidated recently are much more likely to be invalidated again than ones which have remained valid for a long time, but in principle an old Id can still be invalidated.
Ideally I don't want to do this in memory (or at least I want a way that lets me save the result to disk efficiently), so an idea which lets me piggyback off an existing DB implementation of some sort would be especially appreciated.

Resources