Deleting from a HashTable - algorithm

I'm using tombstone method to delete elements from a hash table.
That is, instead of deallocating the node and reorganizing the hash table I simply put DELETED mark on the deleted index and making it available for further INSERT operations and avoid it from breaking the SEARCH operation.
However, after # of those markers exceed a certain number I actually want to deallocate those nodes and reorganize my table.
I've thought of allocating a new table which has size of: Old Table Size - # of DELETED marks and inserting nodes that are NOT EMPTY and that do not have DELETED mark to this new table
using the regular INSERT but this seemed like overkill to me. Is there a better method to do what I want ?
My table uses Open Adressing with hash functions such as Linear Probing, Double Hashing etc.

The algorithm you describe is essentially rehashing, and that's an entirely reasonable approach to the problem. It has the virtue of being exactly the same code as you would use when your hash table occupancy becomes too large.
Estimating an appropriate size for the new table is tricky. It's usually a good idea to not shrink hash tables aggressively after deletions, in case the table is about to start growing again.

Related

Hash table rehashing and iterator invalidation

What techniques are known to prevent iterator invalidation after/during rehashing? In particular, I'm interested in collision-chaining hash tables with incremental rehashing.
Suppose we're iterating a hash table via an iterator, insert an element during the iteration, and that insertion causes full or partial table rehash. I'm looking for hash table variants which allow to continue iteration and be sure that all elements are visited (except the newly inserted one maybe, it doesn't matter) and no element is visited twice.
AFAIK C++ unordered_map invalidates iterators during rehash. Also, AFAIK Go's map has incremental rehashing and doesn't invalidate iterators (range loop state), so it's likely what I'm looking for, but I can't fully understand the source code so far.
One possible solution is to have a doubly-linked list of all elements, parallel to the hash table, which is not affected by rehashing. This solution requires two extra pointers per element. I feel that better solutions should exist.
AFAIK C++ unordered_map invalidates iterators during rehash.
Correct. cppreference.com summarises unordered_map iterator invalidation thus:
Operations Invalidated
========== ===========
All read only operations, swap, std::swap Never
clear, rehash, reserve, operator= Always
insert, emplace, emplace_hint, operator[] Only if causes rehash
erase Only to the element erased
If you want to use unordered_map, your options are:
call reserve() before you start your iteration/insertions, to avoid rehashing
change max_load_factor() before you start your iterations/insertions, to avoid rehashing
store the elements to be inserted in say a vector during the iteration, then move them into the unordered_map afterwards
create e.g. vector<T*> or vector<reference_wrapper<T>> to the elements, iterate over it instead of the unordered_map, but still do your insertions into the unordered_map
If you really want incremental rehashing, you could write a class that wraps two unordered_maps, and when you see an insertion that would cause a rehash of the first map, you start inserting into the second (for which you'd reserve twice the size of the first map). You could manually control when all the elements from the first map were shifted to the second, or have it happen as a side effect of some other operation (e.g. migrate one element each time a new element is inserted). This wrapping approach will be much easier than writing an incremental rehashing table from scratch.

Inserting an element to a full hash table with a constant number of buckets

I am studying hash table at the moment, and got a question about its implementation with a fixed size of buckets.
Suppose we have a hash table with 23 elements(for example). Let's use the simplest hash function (hash_value = key%table_size) and the keys being integers only. If we say that one bucket can have at most only 1 element(no separate chaining), does it mean that when all buckets are full we will no longer be able to insert any element in the table at all? Or will we have to actually replace element that has the same hash value with a new element?
I do understand that I am putting a lot of constrains , and the real implementation might never look like that,but I want to be sure I understand that particular case.
A real implementation usually allows for a hash table to be able to resize, but this usually takes a long time and is undesired. Considering a fixed-size hash table, it would probably return an error code or throw an exception for the user to treat that error or not.
Or will we have to actually replace element that has the same hash value with a new element?
In Java's HashMap if you add a key that equals to another already present in the hash table only the value associated with that key will be replaced by the new one, but never if two keys hash to the same hash.
Yes. An "open" hash table - which you are describing - has a fixed size, so it can fill up.
However implementations will usually respond by copying all contents into a new, bigger table. In fact, normally they won't wait to fill entirely, but use some criterion - for example a fraction of all space used (sometimes called the "load factor") - to decide when it's time to expand.
Some implementations will also "shrink" themselves to a smaller table if the load factor becomes too small due to deletions.
You'd probably find reading Google's hash table implementation, which includes some documentation of its internals, to be a good learning experience.

resizing tradeoffs when implementing Hashtable using linear probing

I am trying to implement a hashtable using linear probing.
Before inserting a (key, value) pair into the hashtable, I want to check if it's half full. If it is, I need to double the size of the underlying array.
Obviously, there are two ways to do that:
One is to create another array with the doubled size, rehash all entries in the old one and add them to the new array. Then, rebind the old array to the new one. This way is easy to implement but uses a lot of space.
The other one is to double the array and do the rehashing in-place. It seems that this way may lead to longer running time because rehashing may cause collisions with both newly hashed slots and old slots.
Which way should I use?
Your second solution only saves space during the resize process if there is in fact room to expand the existing hash table in-place - I think the chances of that being the case for a large hash table are quite slim, so I would just go for your first solution.

Are there any hash functions that allow you to resize the table without also rehashing (removing + reinserting) the contents?

Is it possible using a certain hash function and method (the division method, or double hashing) to make a chained hash table that can be resized without having to reinsert (rehash) each element already in the table?
You would still need to reinsert, but some way to make that cheaper would be to store the hash value before the modulus was applied. That way, you can save a large part of the calculation cost of rehashing.
With this approach, it would be possible to shrink the table in size as well.
I can only assume the reason you want to avoid rehashing everything is that the resulting high latency operation is not an issue to throughput but is instead a problem for responsiveness (either human or in SLA sense)
In theory you could use a modified closed addressing hash table like so:
remember all previous sizes where elements were added
On resize keep the old buckets around linked to internally via a map of sizeWhenUsed -> buckets (obviously if the buckets are empty no need to bother)
Invariant a mapping of Key k exists in only one of the 'internal hash tables' at any time.
on addition of a value you must first look it up in all the other maps to determine if the entry already exists and is mapped. If it is remove it from the old one and add it to the new one.
if an internal map becomes empty/below a certain size it should be deleted and remaining elements moved into the current hash table.
so long as the number of internal hashes is kept constant this will not impact the big O behaviour of the data structure in time, though it will in memory.
This will however affect the actual performance as X additional checks must be made where X is the number of old hashes maintained.
If the wasted space of the list of buckets (the buckets themselves will be null if empty so are zero cost unless populated) becomes significant (use a fudge factor for this) then at some point on a rehash you may have to take the hit of moving things into the current table unless you are willing to expend essentially unlimited memory.
Downgrades in size of the hash will only function in the desired manner (releasing memory) if you are willing to rehash. This is unavoidable.
It is possible you could make use of some complex additional data within an open addressing scheme to 'flag' which of the internal hashes the cell was in use by but removals would be extremely complex to get right and would be very expensive unless you just left them as wasted space. I would never attempt this.
I would not suggest using the former method either unless the underlying data spent very little time in the hash, thus the related churn would tend to steadily 'erase' the older sized hashes. It is likely that a hash tuned for just this sort of behaviour and preset with an appropriate size would perform much better though.
Since the above scheme is simply trading wasted memory and throughput for reduction in the expensive operations with speculative (at best) chance of reducing this waste I would suggest simply pre-sizing your hash to be larger than required and thus never resized would be a more sensible option.
Probably not - the hash would have to not use any variety of modulus, which would mean that it would have a required table size depending on the data anyway.
All hash tables must deal with collisions, either through chaining or probing or whatever, so, I suspect that if upon table resize you simply resized the table (IE, you don't re-insert everything), you would have a functional, though highly non optimal, hash table.
I assume you're asking this question because you want to avoid the high cost of resizing a hash table. You want a hash table which has guaranteed constant time (assuming no collision problems, of course). This can be done.
The trick is to iteratively initialize the next-size hash table while the current one is filling up. By the time you need it, it's ready.
Quick pseudo-code to add an element:
if resizing then
smallTable = bigTable
bigTable = new T[smallTable.length * 2] //if allocation zeroes memory, we lose O(1)
set state to zeroing
elseif zeroing then
zero a small amount of the bigTable memory
if done zeroing then set state to transfering
elseif transfering then
transfer a few values in the small table to the big table
if done transfering then set state to resizing
end if
add new item to small array
add new item to large array

Best way to remove an entry from a hash table

What is the best way to remove an entry from a hashtable that uses linear probing? One way to do this would be to use a flag to indicate deleted elements? Are there any ways better than this?
An easy technique is to:
Find and remove the desired element
Go to the next bucket
If the bucket is empty, quit
If the bucket is full, delete the element in that bucket and re-add it to the hash table using the normal means. The item must be removed before re-adding, because it is likely that the item could be added back into its original spot.
Repeat step 2.
This technique keeps your table tidy at the expense of slightly slower deletions.
It depends on how you handle overflow and whether (1) the item being removed is in an overflow slot or not, and (2) if there are overflow items beyond the item being removed, whether they have the hash key of the item being removed or possibly some other hash key. [Overlooking that double condition is a common source of bugs in deletion implementations.]
If collisions overflow into a linked list, it is pretty easy. You're either popping up the list (which may have gone empty) or deleting a member from the middle or end of the linked list. Those are fun and not particularly difficult. There can be other optimizations to avoid excessive memory allocations and freeings to make this even more efficient.
For linear probing, Knuth suggests that a simple approach is to have a way to mark a slot as empty, deleted, or occupied. Mark a removed occupant slot as deleted so that overflow by linear probing will skip past it, but if an insertion is needed, you can fill the first deleted slot that you passed over [The Art of Computer Programming, vol.3: Sorting and Searching, section 6.4 Hashing, p. 533 (ed.2)]. This assumes that deletions are rather rare.
Knuth gives a nice refinment as Algorithm R6.4 [pp. 533-534] that instead marks the cell as empty rather than deleted, and then finds ways to move table entries back closer to their initial-probe location by moving the hole that was just made until it ends up next to another hole.
Knuth cautions that this will move existing still-occupied slot entries and is not a good idea if pointers to the slots are being held onto outside of the hash table. [If you have garbage-collected- or other managed-references in the slots, it is all right to move the slot, since it is the reference that is being used outside of the table and it doesn't matter where the slot that references the same object is in the table.]
The Python hash table implementation (arguable very fast) uses dummy elements to mark deletions. As you grow or shrink or table (assuming you're not doing a fixed-size table), you can drop the dummies at the same time.
If you have access to a copy, have a look at the article in Beautiful Code about the implementation.
The best general solutions I can think of include:
If you're can use a non-const iterator (ala C++ STL or Java), you should be able to remove them as you encounter them. Presumably, though, you wouldn't be asking this question unless you're using a const iterator or an enumerator which would be invalidated if the underlying collection is modified.
As you said, you could mark a deleted flag within the contained object. This doesn't release any memory or reduce collisions on the key, though, so it's not the best solution. Also requires the addition of a property on the class that probably doesn't really belong there. If this bothers you as much as it would me, or if you simply can't add a flag to the stored object (perhaps you don't control the class), you could store these flags in a separate hash table. This requires the most long-term memory use.
Push the keys of the to-be-removed items into a vector or array list while traversing the hash table. After releasing the enumerator, loop through this secondary list and remove the keys from the hash table. If you have a lot of items to remove and/or the keys are large (which they shouldn't be), this may not be the best solution.
If you're going to end up removing more items from the hash table than you're leaving in there, it may be better to create a new hash table, and as you traverse your original one, add to the new hash table only the items you're going to keep. Then replace your reference(s) to the old hash table with the new one. This saves a secondary list iteration, but it's probably only efficient if the new hash table will have significantly fewer items than the original one, and it definitely only works if you can change all the references to the original hash table, of course.
If your hash table gives you access to its collection of keys, you may be able to iterate through those and remove items from the hash table in one pass.
If your hash table or some helper in your library provides you with predicate-based collection modifiers, you may have a Remove() function to which you can pass a lambda expression or function pointer to identify the items to remove.
A common technique when time is a factor is to have a second table of deleted items, and clean up the main table when you have time. Commonly used in search engines.
How about enhancing the hash table to contain pointers like a linked list?
When you insert, if the bucket is full, create a pointer from this bucket to the bucket where the new field in stored.
While deleting something from the hashtable, the solution will be equivalent to how you write a function to delete a node from linkedlist.

Resources