How does linear probing handle deletions without breaking lookups? - data-structures

Here is my understanding of linear probing.
For insertion:
- We hash to a certain position. If that position already has a value, we linearly increment to the next position, until we encounter an empty position, then we insert there. That makes sense.
My question revolves around lookup. From descriptions I have read, I believe lookup works like this:
We look at the position the item we are looking for hashes to.
If the position is empty, we return Not Found
If the position is full, we move linearly to positions until we either encounter the value we are looking for, or we encounter an empty position (meaning not found)
So how does this work when we delete an item from a hash? Wouldn't that mess up this lookup? Say two items hash to the same position. We add both items, then delete the first one we added. So now, the expected position of the second item (which had to be moved to a different position, since the first item originally occupied it) is empty. Does deleting handle this in some way?

Great question! You are absolutely right that just removing an item from a linear probing table would cause problems in exactly the circumstance that you are reporting.
There are a couple of solutions to this. One is to use tombstone deletion. In tombstone deletion, to remove an element, you replace the element with a marker called a tombstone that indicates "an element used to be here, but has since been removed." Then, when you do a lookup, you use the same procedure as before: jump to the hash location, then keep stepping forward until you find a blank spot. The idea here is that a tombstone doesn't count as a blank spot, so you'll keep scanning over it to find what you're searching for.
To keep the number of tombstones low, there are nice techniques you can use like overwriting tombstones during insertions or globally rebuilding the table if the number of tombstones becomes too great.
Another option is to use Robin Hood hashing and backward-shift deletion. In Robin Hood hashing, you store the elements in the table in a way that essentially keeps them sorted by their hash code (wraparound at the front of the table makes things a bit more complex than this, but that's the general idea). From there, when doing a deletion, you can shift elements backwards one spot to fill the gap from the removed element until you either hit a blank or an element that's already in the right place and doesn't need to be moved.
For more information on this, check out these lecture slides on linear probing and Robin Hood hashing.
Hope this helps!

Deletion in linear probing (open addressing) is done in such a way that index at which the value is deleted is assigned any marker such as "Deletion". [One can type any value at that index other than None to indicate that value at this index is deleted]. Keep a look at the below code snippet to indicate how i used "Deletion" marker to fill index where value is deleted
if self.table[index] == value:
print("key {} is found in the table and hence deletion tag is updated at that position".format(value))
self.table[index] = "Deletion"
Now, what happens when again search is done, this position is not None and search will continue. See the below snippet how search is implemented in the linear probe
def search(self, value):
index = value % self.table_size
if self.table[index] != value:
while self.table[index] is not None and self.table[index] != value:
index = (index + 1) % self.table_size
if self.table[index] == value:
print("Key is found in the table")
else:
print("key is not found in the table")
One can also look at the github code explaining deletion in linear probing without breaking lookups.

Related

Will key in the index be removed after deletion in B Plus tree?

I'm a little confused with the deletion in B+ tree. I searched a lot in Google and find that there are two implementation when the key you want to delete appears in the index:
Delete the key in the index
Keep the key in the index
Algorithm from https://www.javatpoint.com/b-plus-tree-deletion uses the first way.
Algorithm from https://www.cs.princeton.edu/courses/archive/fall08/cos597A/Notes/BplusInsertDelete.pdf uses the second way.
So I really want to know which one is right.
But I'm more inclined to take that as an undefined behavior. At this point, could someone help me figure out the advantage and disadvantage between them? And how to choose between them?
Thanks in advance.
Both methods are correct.
The difference that you highlight is not so much in deleting/not-deleting internal keys, but in updating/not-updating them.
Obviously, when you delete a value (i.e. a key in a leaf node), the b-plus-tree property is not violated: all child values are still within the range dictated by the parent information. You can never break this range-rule by merely removing a value from a leaf. This rule is also still valid when you update the internal key(s) in the path to that leaf (according to method 1), which is only necessary when the deleted value was the left-most one in its leaf.
Note that the two methods may produce quite different trees after a long sequence of the same operations (insert, delete).
But on average the second method will have slightly less work to do. This difference is not significant though.

Is there a specific scenario of a hash table that isn't full yet an insertion can't occur?

What I mean to ask is for a hash-table following the standard size of a prime number, is it possible to have some scenario (of inserted keys) where no further insertion of a given element is possible even though there's some empty slots? What kind of hash-function would achieve that?
So, most hash functions allow for collisions ("Hash Collisions" is the phrase you should google to understand this better, by the way.) Collisions are handled by having a secondary data structure, like a list, to store all of the values inserted at keys with the same hash.
Because these data structures can generally store arbitrarily many elements, you will always be able to insert into the hash table, but the performance will get worse and worse, approaching the performance of the backing data structure.
If you do not have a backing data structure, then you can be unable to insert as soon as two things get added to the same position. Since a good hash function distributes things evenly and effectively randomly, this would happen pretty quickly (see "The Birthday Problem").
There are failure-to-insert scenarios for some but not all hash table implementations.
For example, closed hashing aka open addressing implementations use some logic to create a sequence of buckets in which they'll "probe" for values not found at the hashed-to bucket due to collisions. In the real world, sometimes the sequence-creation is pretty basic, for example:
the programmer might have hard-coded N prime numbers, thinking the odds of adding in each of those in turn and still not finding an empty bucket are low (but a malicious user who knows the hash table design may be able to calculate values to make the table fail, or it may simply be so full that the odds are no longer good, or - while emptier - a statistical freak event)
the programmer might have done something like picked a prime number they liked - say 13903 - to add to the last-probed bucket each time until a free one is found, but if the table size happens to be 13903 too it'll keep checking the same bucket.
Still, there are probing approaches such as linear probing that guarantee to try all buckets (unless the implementation goes out of its way to put a limit on retries). It has some other "issues" though, and won't always be the best choice.
If a hash table is implemented using open addressing instead of separate chaining, then it is a good idea to leave at least 1 slot empty to simplify the algorithm.
In open addressing when we are trying to find an element, we first compute the hash index i, then check the table at indexes {i, i + 1, i + 2, ... N - 1, (wrapping around) 0, 1, 2, ...}, until we either find the element we want or hit an empty slot. You can see that in this algorithm, if no slot is empty but the element can't be found, then the search would loop forever.
However, I should emphasize that enforcing merely simplifies the search algorithm. Because alternatively, the search algorithm can remember the starting index i, and halt the search if the entire table has been scanned and it lands back at index i.

Why does a hash table with linear probing need a “no object” value or a parallel array of Boolean?

Why does a hash table with linear probing need a “no object” value or a parallel array of Boolean?
Give an example of the problem that can happen if we have neither of these techniques? Which
technique is more space efficient? Why?
Ask yourself this: What happens when we delete an item from a hashtable? If at some point when we inserted an element into the hashtable that caused a collision, we would have to linearly probe to find a better spot, but if we delete the original item and don't leave a marker we'll never be able to find the new item again.
As to which one is more efficient, generally leaving a "no object" value is the best, because the space has to be taken up any way so we might as well use it for something rather than allocating a whole new array to keep track of what memory is useless in the hashtable.

Search based on Second value in a map

I have a mapping of String id -> Object. Apart from merely having to insert and delete into this map, I would also need to find the id with the lowest x-value (x-value is a member in the Class from which the Object is instantiated).
Initially I thought I could just create another mapping x-value -> String id for this. But that does not help this much, because in case of Remove operation, I have to now anyway search this second map for a particular id (so we are back to the main problem itself now).
Any suggestions to do this efficiently? (time wise - memory is not a big constraint)
EDIT: I think I could just get the x-value from the id (for removal function) and remove from second map using the x-value. Another thing here - the x-value is a float. Good idea to use float as a key in a map ?? Maybe using fabs and a precision value could do the trick here for floating point comparisons ?
EDIT #2: Unfortunately I remembered why the above method might not work (I was busy with other stuff and forgot about this project for a while). The x-value for different map entries NEED NOT BE UNIQUE. String ID is the primary key. So I need to use a multimap and use equal_range.
Your solution of using an auxiliary map isn't as bad as your post suggests.
It is true that a removal operation would require a lookup in the second map. However, this lookup can be done in O(log n) time. This is unlikely to be a deal breaker. If it is, please post more details.
How often do you remove objects? Usually in cases like that you have to think about the frequency of operations too. If the Removing is done infrequently than your solution with the second map could be quite good.
If you use tree map for the second mapping, you will immediatelly have minimum element and it will take O(log n) to remove element from it.
One other alternative is to use priority queue backed by double linked list to find minimal element and in first map remember direct reference to the node of the element. This node can be used for removal.

Best way to remove an entry from a hash table

What is the best way to remove an entry from a hashtable that uses linear probing? One way to do this would be to use a flag to indicate deleted elements? Are there any ways better than this?
An easy technique is to:
Find and remove the desired element
Go to the next bucket
If the bucket is empty, quit
If the bucket is full, delete the element in that bucket and re-add it to the hash table using the normal means. The item must be removed before re-adding, because it is likely that the item could be added back into its original spot.
Repeat step 2.
This technique keeps your table tidy at the expense of slightly slower deletions.
It depends on how you handle overflow and whether (1) the item being removed is in an overflow slot or not, and (2) if there are overflow items beyond the item being removed, whether they have the hash key of the item being removed or possibly some other hash key. [Overlooking that double condition is a common source of bugs in deletion implementations.]
If collisions overflow into a linked list, it is pretty easy. You're either popping up the list (which may have gone empty) or deleting a member from the middle or end of the linked list. Those are fun and not particularly difficult. There can be other optimizations to avoid excessive memory allocations and freeings to make this even more efficient.
For linear probing, Knuth suggests that a simple approach is to have a way to mark a slot as empty, deleted, or occupied. Mark a removed occupant slot as deleted so that overflow by linear probing will skip past it, but if an insertion is needed, you can fill the first deleted slot that you passed over [The Art of Computer Programming, vol.3: Sorting and Searching, section 6.4 Hashing, p. 533 (ed.2)]. This assumes that deletions are rather rare.
Knuth gives a nice refinment as Algorithm R6.4 [pp. 533-534] that instead marks the cell as empty rather than deleted, and then finds ways to move table entries back closer to their initial-probe location by moving the hole that was just made until it ends up next to another hole.
Knuth cautions that this will move existing still-occupied slot entries and is not a good idea if pointers to the slots are being held onto outside of the hash table. [If you have garbage-collected- or other managed-references in the slots, it is all right to move the slot, since it is the reference that is being used outside of the table and it doesn't matter where the slot that references the same object is in the table.]
The Python hash table implementation (arguable very fast) uses dummy elements to mark deletions. As you grow or shrink or table (assuming you're not doing a fixed-size table), you can drop the dummies at the same time.
If you have access to a copy, have a look at the article in Beautiful Code about the implementation.
The best general solutions I can think of include:
If you're can use a non-const iterator (ala C++ STL or Java), you should be able to remove them as you encounter them. Presumably, though, you wouldn't be asking this question unless you're using a const iterator or an enumerator which would be invalidated if the underlying collection is modified.
As you said, you could mark a deleted flag within the contained object. This doesn't release any memory or reduce collisions on the key, though, so it's not the best solution. Also requires the addition of a property on the class that probably doesn't really belong there. If this bothers you as much as it would me, or if you simply can't add a flag to the stored object (perhaps you don't control the class), you could store these flags in a separate hash table. This requires the most long-term memory use.
Push the keys of the to-be-removed items into a vector or array list while traversing the hash table. After releasing the enumerator, loop through this secondary list and remove the keys from the hash table. If you have a lot of items to remove and/or the keys are large (which they shouldn't be), this may not be the best solution.
If you're going to end up removing more items from the hash table than you're leaving in there, it may be better to create a new hash table, and as you traverse your original one, add to the new hash table only the items you're going to keep. Then replace your reference(s) to the old hash table with the new one. This saves a secondary list iteration, but it's probably only efficient if the new hash table will have significantly fewer items than the original one, and it definitely only works if you can change all the references to the original hash table, of course.
If your hash table gives you access to its collection of keys, you may be able to iterate through those and remove items from the hash table in one pass.
If your hash table or some helper in your library provides you with predicate-based collection modifiers, you may have a Remove() function to which you can pass a lambda expression or function pointer to identify the items to remove.
A common technique when time is a factor is to have a second table of deleted items, and clean up the main table when you have time. Commonly used in search engines.
How about enhancing the hash table to contain pointers like a linked list?
When you insert, if the bucket is full, create a pointer from this bucket to the bucket where the new field in stored.
While deleting something from the hashtable, the solution will be equivalent to how you write a function to delete a node from linkedlist.

Resources