How to improve the performance when doing rehash? - algorithm

At some point we need to increase the size of hash, and normally we just rehash, which leads to re-constructure of the whole hash.
Is there any better solution so that when we increase the size, we don't need to re-construct the whole thing?

You could use http://en.wikipedia.org/wiki/Extendible_hashing, although AFAIK it is used mostly for on-disk databases.
There are also general methods for smoothing out some amortised costs. Starting points for this would be http://en.wikipedia.org/wiki/Static_and_dynamic_data_structures and http://en.wikipedia.org/wiki/Dynamization. One application of this to hash tables would be to always keep two tables, one of size N and one of size 2N or so. When the smaller overflows, start creating a table of size 4N, but don't populate it straight away - populate it incrementally while using the table of size 2N. By the time the table of size 2N is full, the table of size 4N should be ready. For the special case of hash tables, extendible hashing should be better.

Any time you re-hash, there's nothing that says you need to actually re-hash. In fact all that you actually need to do is re-mod (i.e. shift everything's position).
If you cache the hash (hehe, sounds like the start of a dr. seuss book) then you only need to compute it once. So store the hash along with the actual data, and that will save you from needing to calculate the hash again in the future. However I'm assuming that you're not already doing this, you didn't exactly explain the current process.
// Store these instead of the data directly. This assumes immutable data.
struct hashable_item
{
data dat;
int32 hash;
}

Related

When to use hash tables?

What are the cases when using hash table can improve performance, and when it does not? and what are the cases when using hash tables are not applicable?
What are the cases when using hash table can improve performance, and when it does not?
If you have reason to care, implement using hash tables and whatever else you're considering, put your actual data through, and measure which performs better.
That said, if the hash tables has the operations you need (i.e. you're not expecting to iterate it in sorted order, or compare it quickly to another hash table), and has millions or more (billions, trillions...) of elements, then it'll probably be your best choice, but a lot depends on the hash table implementation (especially the choice of closed vs. open hashing), object size, hash function quality and calculation cost / runtime), comparison cost, oddities of your computers memory performance at different cache levels... in short: too many things to make even an educated guess a better choice than measuring, when it matters.
and what are the cases when using hash tables are not applicable?
Mainly when:
The input can't be hashed (e.g. you're given binary blobs and don't know which bits in there are significant, but you do have an int cmp(const T&, const T&) function you could use for a std::map), or
the available/possible hash functions are very collision prone, or
you want to avoid worst-case performance hits for:
handling lots of hash-colliding elements (perhaps "engineered" by someone trying to crash or slow down your software)
resizing the hash table: unless presized to be large enough (which can be wasteful and slow when excessive memory's used), the majority of implementations will outgrow the arrays they're using for the hash table every now and then, then allocate a bigger array and copy content across: this can make the specific insertions that cause this rehashing to be much slower than the normal O(1) behaviour, even though the average is still O(1); if you need more consistent behaviour in all cases, something like a balance binary tree may serve
your access patterns are quite specialised (e.g. frequently operating on elements with keys that are "nearby" in some specific sort order), such that cache efficiency is better for other storage models that keep them nearby in memory (e.g. bucket sorted elements), even if you're not exactly relying on the sort order for e.g. iteration
We use Hash Tables to get access time of O(1). Imagine a dictionary. When you are looking for a word, eg "happy", you jump straight to 'H'. Here the hash function is determined by the starting alphabet. And then you look for happy within the H bucket (actually H bucket then HA bucket then HAP bucket anbd so on).
It doesn't make sense to use Hash Tables when your data is ordered or needs ordering like sorted numbers. (Alphabets are ordered ABCD....XYZ but it wouldn't matter if you switched A and Z, provided you know it is switched in your dictionary.)

In-memory tree/index structure for fast lookups and insertions of increasing integer keys

Background: I'm going to be inserting about a billion key value pairs. I need an in-memory index with which I can simultaneously do look ups for the (32 bit integer) value for a (unique, 64 bit integer) key. There's no updating, no deleting and no traversing. The keys are generally gradually increasing with time.
What index structure is most appropriate to handle this?
The requirements I can think of are:
It needs to have efficient rebalancing, due to the increasing keys
It needs to use memory efficiently to fit in ram, preferably < 28GB
It needs to have very efficient lookups
There's probably no more efficient datastructure for this problem than a simple sorted vector. (Actually, given alignment issues and depending on access characteristics, you might want to put keys and values in separate vectors.) But there are a number of practical problems, particularly if you don't know how big the data will be. If you do know this, or if you're prepared to just preallocate too much space and then die if you get more data than will fit in this space, then that's fine, although you still need to worry about keeping the vector sorted.
A possibly better approach is to keep a binary search tree of index ranges, where the leaves of the BST point to "clumps" of data (i.e. vectors). (This is essentially a B+ tree.) The clumps can be reasonably large; I'd say something like the amount of data you expect to receive in a couple of minutes, or several thousand entries. They don't have to all be the same size. (B+-trees usually have a smaller fanout than that, but since your data is "mostly sorted", you should be able to use a larger one. Don't make it too large; the only point is to reduce overhead and possibly cache-thrashing.)
Since your data is "mostly sorted", you can accumulate data for a while, keeping it in an ordinary ordered map (assuming you have such a thing), or even in a vector using insertion sort. When this buffer gets large enough, you can append it to your main data structure as a single clump, repartitioning the last clump to deal with overlaps.
If you're reasonably certain that you will only rarely get out-of-order keys, would be to keep a second conventional BST of out-of-order data elements. Any element which cannot be accomodated by repartitioning the new clump and the previous last one can just be added to this BST. To do a lookup, you do a parallel lookup between the main structure and the out-of-order structure.
If you're paranoid or not certain enough about the amount of unordered data, just use the standard B+-tree insertion algorithm, which consists of creating clumps with a little bit of reserved but unused space to allow for insertions (a few per cent; you want to avoid space overhead), and splitting a clump if necessary.

Data structure for tiled map

I want to make an infinite tiled map, from (-max_int,-max_int) until (max_int,max_int), so I'm gonna make a basic structure: chunk, each chunk contain char tiles[w][h] and also it int x, y coordinates, so for example h=w=10 so tile(15,5) is in chunk(1,0) on (5,5) coordinate, and tile(-25,-17) is in chunk(-3,-2)on(5,3) and so on. Now there can be any amount of chunks and I need to store them and easy access them in O(logn) or better ( O(1) if possible.. but it's not.. ). It should be easy to: add, ??remove??(not must) and find. So what data structure should I use?
Read into KD-tree or Quad-tree (the 2d variant of Octree). Both of these might be a big help here.
So all your space is splited into chunks (rectangular clusters). Generally problem is storing data in sparse (since clusters already implemented) matrix. Why not to use two-level dictionary-like containers?.. I.e. rb-tree by row index where value is rb-tree by column index. Or if you are lucky you can use hashes to get your O(1). In both cases if you can't find row you allocate it in container and create new container as value but initially with only single chunk. Of course allocating new chunk on existing row will be a bit faster than on new one and I guess that's the only issue with this approach.

Choosing a Data structure for very large data

I have x (millions) positive integers, where their values can be as big as allowed (+2,147,483,647). Assuming they are unique, what is the best way to store them for a lookup intensive program.
So far i thought of using a binary AVL tree or a hash table, where the integer is the key to the mapped data (a name). However am not to sure whether i can implement such large keys and in such large quantity with a hash table (wouldn't that create a >0.8 load factor in addition to be prone for collisions?)
Could i get some advise on which data structure might be suitable for my situation
The choice of structure depends heavily on how much memory you have available. I'm assuming based on the description that you need lookup but not to loop over them, find nearest, or other similar operations.
Best is probably a bucketed hash table. By placing hash collisions into buckets and keeping separate arrays in the bucket for keys and values, you can both reduce the size of the table proper and take advantage of CPU cache speedup when searching a bucket. Linear search within a bucket may even end up faster than binary search!
AVL trees are nice for data sets that are read-intensive but not read-only AND require ordered enumeration, find nearest and similar operations, but they're an annoyingly amount of work to implement correctly. You may get better performance with a B-tree because of CPU cache behavior, though, especially a cache-oblivious B-tree algorithm.
Have you looked into B-trees? The efficiency runs between log_m(n) and log_(m/2)(n) so if you choose m to be around 8-10 or so you should be able to keep your search depth to below 10.
Bit Vector , with the index set if the number is present. You can tweak it to have the number of occurrences of each number. There is a nice column about bit vectors in Bentley's Programming Pearls.
If memory isn't an issue a map is probably your best bet. Maps are O(1) meaning that as you scale up the number of items to be looked up the time is takes to find a value is the same.
A map where the key is the int, and the value is the name.
Do try hash tables first. There are some variants that can tolerate being very dense without significant slowdown (like Brent's variation).
If you only need to store the 32-bit integers and not any associated record, use a set and not a map, like hash_set in most C++ libraries. It would use only 4-bytes records plus some constant overhead and a little slack to avoid being 100%. In the worst case, to handle 'millions' of numbers you'd need a few tens of megabytes. Big, but nothing unmanageable.
If you need it to be much tighter, just store them sorted in a plain array and use binary search to fetch them. It will be O(log n) instead of O(1), but for 'millions' of records it's still just twentysomething steps to get any one of them. In C you have bsearch(), which is as fast as it can get.
edit: just saw in your question you talk about some 'mapped data (a name)'. are those names unique? do they also have to be in memory? if yes, they would definitely dominate the memory requirements. Even so, if the names are the typical english words, most would be 10 bytes or less, keeping the total size in the 'tens of megabytes'; maybe up to a hundred megs, still very manageable.

Are there any hash functions that allow you to resize the table without also rehashing (removing + reinserting) the contents?

Is it possible using a certain hash function and method (the division method, or double hashing) to make a chained hash table that can be resized without having to reinsert (rehash) each element already in the table?
You would still need to reinsert, but some way to make that cheaper would be to store the hash value before the modulus was applied. That way, you can save a large part of the calculation cost of rehashing.
With this approach, it would be possible to shrink the table in size as well.
I can only assume the reason you want to avoid rehashing everything is that the resulting high latency operation is not an issue to throughput but is instead a problem for responsiveness (either human or in SLA sense)
In theory you could use a modified closed addressing hash table like so:
remember all previous sizes where elements were added
On resize keep the old buckets around linked to internally via a map of sizeWhenUsed -> buckets (obviously if the buckets are empty no need to bother)
Invariant a mapping of Key k exists in only one of the 'internal hash tables' at any time.
on addition of a value you must first look it up in all the other maps to determine if the entry already exists and is mapped. If it is remove it from the old one and add it to the new one.
if an internal map becomes empty/below a certain size it should be deleted and remaining elements moved into the current hash table.
so long as the number of internal hashes is kept constant this will not impact the big O behaviour of the data structure in time, though it will in memory.
This will however affect the actual performance as X additional checks must be made where X is the number of old hashes maintained.
If the wasted space of the list of buckets (the buckets themselves will be null if empty so are zero cost unless populated) becomes significant (use a fudge factor for this) then at some point on a rehash you may have to take the hit of moving things into the current table unless you are willing to expend essentially unlimited memory.
Downgrades in size of the hash will only function in the desired manner (releasing memory) if you are willing to rehash. This is unavoidable.
It is possible you could make use of some complex additional data within an open addressing scheme to 'flag' which of the internal hashes the cell was in use by but removals would be extremely complex to get right and would be very expensive unless you just left them as wasted space. I would never attempt this.
I would not suggest using the former method either unless the underlying data spent very little time in the hash, thus the related churn would tend to steadily 'erase' the older sized hashes. It is likely that a hash tuned for just this sort of behaviour and preset with an appropriate size would perform much better though.
Since the above scheme is simply trading wasted memory and throughput for reduction in the expensive operations with speculative (at best) chance of reducing this waste I would suggest simply pre-sizing your hash to be larger than required and thus never resized would be a more sensible option.
Probably not - the hash would have to not use any variety of modulus, which would mean that it would have a required table size depending on the data anyway.
All hash tables must deal with collisions, either through chaining or probing or whatever, so, I suspect that if upon table resize you simply resized the table (IE, you don't re-insert everything), you would have a functional, though highly non optimal, hash table.
I assume you're asking this question because you want to avoid the high cost of resizing a hash table. You want a hash table which has guaranteed constant time (assuming no collision problems, of course). This can be done.
The trick is to iteratively initialize the next-size hash table while the current one is filling up. By the time you need it, it's ready.
Quick pseudo-code to add an element:
if resizing then
smallTable = bigTable
bigTable = new T[smallTable.length * 2] //if allocation zeroes memory, we lose O(1)
set state to zeroing
elseif zeroing then
zero a small amount of the bigTable memory
if done zeroing then set state to transfering
elseif transfering then
transfer a few values in the small table to the big table
if done transfering then set state to resizing
end if
add new item to small array
add new item to large array

Resources