Is deleting a hash faster than deleting keys individually? or does deleting a hash just iterate over the keys in the hash? - caching

Currently trying to build a caching system that's efficient to update. I will need to delete my hashes here and there. Would it be faster to delete a whole hash at once or are there no performance differences in deleting a whole hash?
I'm wondering this as the redis documentation states the DEL command has a time complexity of O(M) for hashes which I'm unsure means it's referring to single key deletion within the hash (amount of keys in the hash i choose to delete) or the hash overall.
I'm expecting it to be O(1) for deletion of an entire hash but I could be wrong.

Related

How do hash tables resolve bucket ambiguity and probes?

I am reading Data Structures and Algorithms & Software Principles in C to try to wrap my head around some internals of data structures, and two things are really bothering me:
(1) How do hash tables deal with deciding which item in the bucket is the item you are looking up if they all have the same hash?
e.g.
Get Key, Value
use Hash algorithm on the key to find the index to try to put value into
if the slot is taken, but there is no bucket(single entry), create a bucket and throw the current item into the bucket and then throw the current value into it.
now I have a bucket with a bunch of values and a "lost and found problem" where you can't tell which value belongs to which key because all the keys map to the same hash and the item in the bucket has no key to search the bucket by key.
This would work if the bucket saves keys as well as values for each entry, but I am confused since I can't find a site that confirms that hash tables save keys along with the values for their entries.
(2) How do hash tables tell if the value at an index is the correct value for the key, or if probing found a collision and put it elsewhere.
eg.
Get Key, Value
hash key to find index(0)
index taken, use a naive probe algorithm of perform linear search until slot found(slot 1 is empty).
now I search for my key and find index 0. How does the hash know that index 0 is not the correct item for this key, but that it has been probed into slot 1?
Again, this would make sense to me if the table saved a key as well as value for the entry, but I am not sure if hashes save keys along with values for the entries or have another way of ensuring that the item at the hash index or bucket index is the correct item, or if I am misunderstanding it.
To clarify the question: do hash tables save key along with value to disambiguate buckets and probe sequences or do they use something else to avoid ambiguity of hashes?
Sorry for the crudely formulated question but I just had to ask.
Thanks ahead of time.
Hash Tables save the entry. An entry consists of key and value.
How do hash tables deal with deciding which item in the bucket is the item you are looking up if they all have the same hash?
Because query is done by passing the key.
Purpose of hashing is to reduce the time to find the index. They key is hashed to find the right bucket. Then, when the items have been reduced from a total N to a very small n, you can even perform a linear search to find the right item out of all the keys having the same hash.
How do hash tables tell if the value at an index is the correct value for the key, or if probing found a collision and put it elsewhere.
Again, that's because Hash Table would save entries instead of just the value. If, in case of a collision, the Hash Table sees that the key found at this bucket is not the key that's queried, the Hash Table knows that the collision occurred earlier and the key may be in the next bucket. Please note that in this case the bucket stores a single entry unlike the case of first answer where the bucket may store a LinkedList or a Tree of entries.

resizing tradeoffs when implementing Hashtable using linear probing

I am trying to implement a hashtable using linear probing.
Before inserting a (key, value) pair into the hashtable, I want to check if it's half full. If it is, I need to double the size of the underlying array.
Obviously, there are two ways to do that:
One is to create another array with the doubled size, rehash all entries in the old one and add them to the new array. Then, rebind the old array to the new one. This way is easy to implement but uses a lot of space.
The other one is to double the array and do the rehashing in-place. It seems that this way may lead to longer running time because rehashing may cause collisions with both newly hashed slots and old slots.
Which way should I use?
Your second solution only saves space during the resize process if there is in fact room to expand the existing hash table in-place - I think the chances of that being the case for a large hash table are quite slim, so I would just go for your first solution.

Using an array list with a hash table

I'm attempting to build a simple hash table from scratch. The hash table I have currently uses an array of linked lists. The hashing function takes the hash value of a key-pair objects modulo the size of the array for indexing. This is all well and good, but I'm wondering if I could dynamically expand my array by using an array-list once it starts to fill up (Tell me why this is not a good idea if you think so). Obviously the hash function would be compromised since we're finding indexes using the array length. What would be a good hash function to use that would allow my array of linked-lists to expand while not compromising the integrity of the hash function?
If I am understanding your question correctly, you will have to re-hash all elements after expanding the bucket array. It can be done by iterating over the contents of the old hash table, and inserting them into the newly expanded hash table.

Hash table optimized for full iteration + key replacement

I have a hash table where the vast majority of accesses at run-time follow one of the following patterns:
Iterate through all key/value pairs. (The speed of this operation is critical.)
Modify keys (i.e. remove a key/value pair & add another with the same value but a different key. Detect duplicate keys & combine values if necessary.) This is done in a loop, affecting many thousands of keys, but with no other operations intervening.
I would also like it to consume as little memory as possible.
Other standard operations must be available, though they are used less frequently, e.g.
Insert a new key/value pair
Given a key, look up the corresponding value
Change the value associated with an existing key
Of course all "standard" hash table implementations, including standard libraries of most high-level-languages, have all of these capabilities. What I am looking for is an implementation that is optimized for the operations in the first list.
Issues with common implementations:
Most hash table implementations use separate chaining (i.e. a linked list for each bucket.) This works but I am hoping for something that occupies less memory with better locality of reference. Note: my keys are small (13 bytes each, padded to 16 bytes.)
Most open addressing schemes have a major disadvantage for my application: Keys are removed and replaced in large groups. That leaves deletion markers that increase the load factor, requiring the table to be re-built frequently.
Schemes that work, but are less than ideal:
Separate chaining with an array (instead of a linked list) per bucket:
Poor locality of reference, resulting from memory fragmentation as small arrays are reallocated many times
Linear probing/quadratic hashing/double hashing (with or without Brent's Variation):
Table quickly fills up with deletion markers
Cuckoo hashing
Only works for <50% load factor, and I want a high LF to save memory and speed up iteration.
Is there a specialized hashing scheme that would work well for this case?
Note: I have a good hash function that works well with both power-of-2 and prime table sizes, and can be used for double hashing, so this shouldn't be an issue.
Would Extendable Hashing help? Iterating though the keys by walking the 'directory' should be fast. Not sure if the "modify key for value" operation is any better with this scheme or not.
Based on how you're accessing the data, does it really make sense to use a hash table at all?
Since you're main use cases involve iteration - a sorted list or a btree might be a better data structure.
It doesnt seem like you really need the constant time random data access a hash table is built for.
You can do much better than a 50% load factor with cuckoo hashing.
Two hash functions with four items will get you over 90% with little effort. See this paper:
http://www.ru.is/faculty/ulfar/CuckooHash.pdf
I'm building a pre-computed dictionary using a cuckoo hash and getting a load factor of better than 99% with two hash functions and seven items per bucket.

Best way to remove an entry from a hash table

What is the best way to remove an entry from a hashtable that uses linear probing? One way to do this would be to use a flag to indicate deleted elements? Are there any ways better than this?
An easy technique is to:
Find and remove the desired element
Go to the next bucket
If the bucket is empty, quit
If the bucket is full, delete the element in that bucket and re-add it to the hash table using the normal means. The item must be removed before re-adding, because it is likely that the item could be added back into its original spot.
Repeat step 2.
This technique keeps your table tidy at the expense of slightly slower deletions.
It depends on how you handle overflow and whether (1) the item being removed is in an overflow slot or not, and (2) if there are overflow items beyond the item being removed, whether they have the hash key of the item being removed or possibly some other hash key. [Overlooking that double condition is a common source of bugs in deletion implementations.]
If collisions overflow into a linked list, it is pretty easy. You're either popping up the list (which may have gone empty) or deleting a member from the middle or end of the linked list. Those are fun and not particularly difficult. There can be other optimizations to avoid excessive memory allocations and freeings to make this even more efficient.
For linear probing, Knuth suggests that a simple approach is to have a way to mark a slot as empty, deleted, or occupied. Mark a removed occupant slot as deleted so that overflow by linear probing will skip past it, but if an insertion is needed, you can fill the first deleted slot that you passed over [The Art of Computer Programming, vol.3: Sorting and Searching, section 6.4 Hashing, p. 533 (ed.2)]. This assumes that deletions are rather rare.
Knuth gives a nice refinment as Algorithm R6.4 [pp. 533-534] that instead marks the cell as empty rather than deleted, and then finds ways to move table entries back closer to their initial-probe location by moving the hole that was just made until it ends up next to another hole.
Knuth cautions that this will move existing still-occupied slot entries and is not a good idea if pointers to the slots are being held onto outside of the hash table. [If you have garbage-collected- or other managed-references in the slots, it is all right to move the slot, since it is the reference that is being used outside of the table and it doesn't matter where the slot that references the same object is in the table.]
The Python hash table implementation (arguable very fast) uses dummy elements to mark deletions. As you grow or shrink or table (assuming you're not doing a fixed-size table), you can drop the dummies at the same time.
If you have access to a copy, have a look at the article in Beautiful Code about the implementation.
The best general solutions I can think of include:
If you're can use a non-const iterator (ala C++ STL or Java), you should be able to remove them as you encounter them. Presumably, though, you wouldn't be asking this question unless you're using a const iterator or an enumerator which would be invalidated if the underlying collection is modified.
As you said, you could mark a deleted flag within the contained object. This doesn't release any memory or reduce collisions on the key, though, so it's not the best solution. Also requires the addition of a property on the class that probably doesn't really belong there. If this bothers you as much as it would me, or if you simply can't add a flag to the stored object (perhaps you don't control the class), you could store these flags in a separate hash table. This requires the most long-term memory use.
Push the keys of the to-be-removed items into a vector or array list while traversing the hash table. After releasing the enumerator, loop through this secondary list and remove the keys from the hash table. If you have a lot of items to remove and/or the keys are large (which they shouldn't be), this may not be the best solution.
If you're going to end up removing more items from the hash table than you're leaving in there, it may be better to create a new hash table, and as you traverse your original one, add to the new hash table only the items you're going to keep. Then replace your reference(s) to the old hash table with the new one. This saves a secondary list iteration, but it's probably only efficient if the new hash table will have significantly fewer items than the original one, and it definitely only works if you can change all the references to the original hash table, of course.
If your hash table gives you access to its collection of keys, you may be able to iterate through those and remove items from the hash table in one pass.
If your hash table or some helper in your library provides you with predicate-based collection modifiers, you may have a Remove() function to which you can pass a lambda expression or function pointer to identify the items to remove.
A common technique when time is a factor is to have a second table of deleted items, and clean up the main table when you have time. Commonly used in search engines.
How about enhancing the hash table to contain pointers like a linked list?
When you insert, if the bucket is full, create a pointer from this bucket to the bucket where the new field in stored.
While deleting something from the hashtable, the solution will be equivalent to how you write a function to delete a node from linkedlist.

Resources