Will key in the index be removed after deletion in B Plus tree? - algorithm

I'm a little confused with the deletion in B+ tree. I searched a lot in Google and find that there are two implementation when the key you want to delete appears in the index:
Delete the key in the index
Keep the key in the index
Algorithm from https://www.javatpoint.com/b-plus-tree-deletion uses the first way.
Algorithm from https://www.cs.princeton.edu/courses/archive/fall08/cos597A/Notes/BplusInsertDelete.pdf uses the second way.
So I really want to know which one is right.
But I'm more inclined to take that as an undefined behavior. At this point, could someone help me figure out the advantage and disadvantage between them? And how to choose between them?
Thanks in advance.

Both methods are correct.
The difference that you highlight is not so much in deleting/not-deleting internal keys, but in updating/not-updating them.
Obviously, when you delete a value (i.e. a key in a leaf node), the b-plus-tree property is not violated: all child values are still within the range dictated by the parent information. You can never break this range-rule by merely removing a value from a leaf. This rule is also still valid when you update the internal key(s) in the path to that leaf (according to method 1), which is only necessary when the deleted value was the left-most one in its leaf.
Note that the two methods may produce quite different trees after a long sequence of the same operations (insert, delete).
But on average the second method will have slightly less work to do. This difference is not significant though.

Related

Chess: Extracting the principal variation from the transposition table

Earlier, I was having an issue involving my principal variation becoming truncated by an alpha-beta search. Indeed, this appears to be a common issue. From the authors of Crafty:
Another solution with even worse properties is to extract the full PV
from the transposition table, and avoid using the triangular array
completely. If the transposition table is large enough so that
nothing gets overwritten at all, this would almost work. But there is
a hidden “gotcha”. Once you search the PV, you continue to search
other parts of the tree, and you will frequently encounter some of
those same positions again, but the hash draft will not be deep enough
to allow the entry to be used. You continue searching and now have to
overwrite that entry (exact hash signature match requires this) and
you now leave a transposition “trail” to a different endpoint, one
which does not match the original score. Or any one of the trans/ref
entries between the root and the actual PV endpoint get overwritten by
a different position completely, and you now have no way to search the
transposition table to find the actual endpoint you want to display.
Several use this approach to produce the PV, and the complaints are
generally frequent and loud, because an evaluation paired with an
unrelated PV is not very useful, either for debugging or for analyzing
your own games.
This makes a lot of sense.
Consider that the principal variation is ABCDEF, but AB returns the board to its original position. Then, an alternative line examined later might be CDEFGH, which results in a different evaluation than the earlier search of just CDEF. Thus, the transposition table entry for the board state after AB is overwritten, potentially with a node that will be cut off by alpha-beta (!!), and the PV of ABCDEF is destroyed forever.
Is there any way around this problem, or do I have to use an external data structure to save the PV?
Specifically, what is wrong with replacing if and only if the new entry is deeper and exact? This doesn't seem to work, but I'm not sure why.

Is there a specific scenario of a hash table that isn't full yet an insertion can't occur?

What I mean to ask is for a hash-table following the standard size of a prime number, is it possible to have some scenario (of inserted keys) where no further insertion of a given element is possible even though there's some empty slots? What kind of hash-function would achieve that?
So, most hash functions allow for collisions ("Hash Collisions" is the phrase you should google to understand this better, by the way.) Collisions are handled by having a secondary data structure, like a list, to store all of the values inserted at keys with the same hash.
Because these data structures can generally store arbitrarily many elements, you will always be able to insert into the hash table, but the performance will get worse and worse, approaching the performance of the backing data structure.
If you do not have a backing data structure, then you can be unable to insert as soon as two things get added to the same position. Since a good hash function distributes things evenly and effectively randomly, this would happen pretty quickly (see "The Birthday Problem").
There are failure-to-insert scenarios for some but not all hash table implementations.
For example, closed hashing aka open addressing implementations use some logic to create a sequence of buckets in which they'll "probe" for values not found at the hashed-to bucket due to collisions. In the real world, sometimes the sequence-creation is pretty basic, for example:
the programmer might have hard-coded N prime numbers, thinking the odds of adding in each of those in turn and still not finding an empty bucket are low (but a malicious user who knows the hash table design may be able to calculate values to make the table fail, or it may simply be so full that the odds are no longer good, or - while emptier - a statistical freak event)
the programmer might have done something like picked a prime number they liked - say 13903 - to add to the last-probed bucket each time until a free one is found, but if the table size happens to be 13903 too it'll keep checking the same bucket.
Still, there are probing approaches such as linear probing that guarantee to try all buckets (unless the implementation goes out of its way to put a limit on retries). It has some other "issues" though, and won't always be the best choice.
If a hash table is implemented using open addressing instead of separate chaining, then it is a good idea to leave at least 1 slot empty to simplify the algorithm.
In open addressing when we are trying to find an element, we first compute the hash index i, then check the table at indexes {i, i + 1, i + 2, ... N - 1, (wrapping around) 0, 1, 2, ...}, until we either find the element we want or hit an empty slot. You can see that in this algorithm, if no slot is empty but the element can't be found, then the search would loop forever.
However, I should emphasize that enforcing merely simplifies the search algorithm. Because alternatively, the search algorithm can remember the starting index i, and halt the search if the entire table has been scanned and it lands back at index i.

Search based on Second value in a map

I have a mapping of String id -> Object. Apart from merely having to insert and delete into this map, I would also need to find the id with the lowest x-value (x-value is a member in the Class from which the Object is instantiated).
Initially I thought I could just create another mapping x-value -> String id for this. But that does not help this much, because in case of Remove operation, I have to now anyway search this second map for a particular id (so we are back to the main problem itself now).
Any suggestions to do this efficiently? (time wise - memory is not a big constraint)
EDIT: I think I could just get the x-value from the id (for removal function) and remove from second map using the x-value. Another thing here - the x-value is a float. Good idea to use float as a key in a map ?? Maybe using fabs and a precision value could do the trick here for floating point comparisons ?
EDIT #2: Unfortunately I remembered why the above method might not work (I was busy with other stuff and forgot about this project for a while). The x-value for different map entries NEED NOT BE UNIQUE. String ID is the primary key. So I need to use a multimap and use equal_range.
Your solution of using an auxiliary map isn't as bad as your post suggests.
It is true that a removal operation would require a lookup in the second map. However, this lookup can be done in O(log n) time. This is unlikely to be a deal breaker. If it is, please post more details.
How often do you remove objects? Usually in cases like that you have to think about the frequency of operations too. If the Removing is done infrequently than your solution with the second map could be quite good.
If you use tree map for the second mapping, you will immediatelly have minimum element and it will take O(log n) to remove element from it.
One other alternative is to use priority queue backed by double linked list to find minimal element and in first map remember direct reference to the node of the element. This node can be used for removal.

Efficient mass modification of persistent data structures

I understand how typically trees are used to modify persistent data structures (create a new node and replace all it's ancestors).
But what if I have a tree of 10,000's of nodes and I need to modify 1000's of them? I don't want to go through and create 1000's of new roots, I only need the one new root that results from modifying everything at once.
For example:
Let's take a persistent binary tree for example. In the single update node case, it does a search until it finds the node, creates a new one with the modifications and the old children, and creates new ancestors up to the root.
In the bulk update case could we do:
Instead of just updating a single node, you're going to update 1000 nodes on it in one pass.
At the root node, the current list is the full list. You then split that list between those that match the left node and those that match the right. If none match one of the children, don't descend to it. You then descend to the left node (assuming there were matches), split its search list between its children, and continue. When you have a single node and a match, you update it and go back up, replacing and updating ancestors and other branches as appropriate.
This would result in only one new root even though it modified any number of nodes.
These kind of "mass modification" operations are sometimes called bulk updates. Of course, the details will vary depending on exactly what kind of data structure you are working with and what kind of modifications you are trying to perform.
Typical kinds of operations might include "delete all values satisfying some condition" or "increment the values associated with all the keys in this list". Frequently, these operations can be performed in a single walk over the entire structure, taking O(n) time.
You seem to be concerned about the memory allocation involved in creating "1000's of new roots". Typical allocation for performing the operations one at a time would be O(k log n), where k is the number of nodes being modified. Typical allocation for performing the single walk over the entire structure would be O(n). Which is better depends on k and n.
In some cases, you can decrease the amount of allocation--at the cost of more complicated code--by paying special attention to when changes occur. For example, if you have a recursive algorithm that returns a tree, you might modify the algorithm to return a tree together with a boolean indicating whether anything has changed. The algorithm could then check those booleans before allocating a new node to see whether the old node can safely be reused. However, people don't usually bother with this extra check unless and until they have evidence that the extra memory allocation is actually a problem.
A particular implementation of what you're looking for can be found in Clojure's (and ClojureScript's) transients.
In short, given a fully-immutable, persistent data structure, a transient version of it will make changes using destructive (allocation-efficient) mutation, which you can flip back into a proper persistent data structure again when you're done with your performance-sensitive operations. It is only at the transition back to a persistent data structure that new roots are created (for example), thus amortizing the attendant cost over the number of logical operations you performed on the structure while it was in its transient form.

Best way to remove an entry from a hash table

What is the best way to remove an entry from a hashtable that uses linear probing? One way to do this would be to use a flag to indicate deleted elements? Are there any ways better than this?
An easy technique is to:
Find and remove the desired element
Go to the next bucket
If the bucket is empty, quit
If the bucket is full, delete the element in that bucket and re-add it to the hash table using the normal means. The item must be removed before re-adding, because it is likely that the item could be added back into its original spot.
Repeat step 2.
This technique keeps your table tidy at the expense of slightly slower deletions.
It depends on how you handle overflow and whether (1) the item being removed is in an overflow slot or not, and (2) if there are overflow items beyond the item being removed, whether they have the hash key of the item being removed or possibly some other hash key. [Overlooking that double condition is a common source of bugs in deletion implementations.]
If collisions overflow into a linked list, it is pretty easy. You're either popping up the list (which may have gone empty) or deleting a member from the middle or end of the linked list. Those are fun and not particularly difficult. There can be other optimizations to avoid excessive memory allocations and freeings to make this even more efficient.
For linear probing, Knuth suggests that a simple approach is to have a way to mark a slot as empty, deleted, or occupied. Mark a removed occupant slot as deleted so that overflow by linear probing will skip past it, but if an insertion is needed, you can fill the first deleted slot that you passed over [The Art of Computer Programming, vol.3: Sorting and Searching, section 6.4 Hashing, p. 533 (ed.2)]. This assumes that deletions are rather rare.
Knuth gives a nice refinment as Algorithm R6.4 [pp. 533-534] that instead marks the cell as empty rather than deleted, and then finds ways to move table entries back closer to their initial-probe location by moving the hole that was just made until it ends up next to another hole.
Knuth cautions that this will move existing still-occupied slot entries and is not a good idea if pointers to the slots are being held onto outside of the hash table. [If you have garbage-collected- or other managed-references in the slots, it is all right to move the slot, since it is the reference that is being used outside of the table and it doesn't matter where the slot that references the same object is in the table.]
The Python hash table implementation (arguable very fast) uses dummy elements to mark deletions. As you grow or shrink or table (assuming you're not doing a fixed-size table), you can drop the dummies at the same time.
If you have access to a copy, have a look at the article in Beautiful Code about the implementation.
The best general solutions I can think of include:
If you're can use a non-const iterator (ala C++ STL or Java), you should be able to remove them as you encounter them. Presumably, though, you wouldn't be asking this question unless you're using a const iterator or an enumerator which would be invalidated if the underlying collection is modified.
As you said, you could mark a deleted flag within the contained object. This doesn't release any memory or reduce collisions on the key, though, so it's not the best solution. Also requires the addition of a property on the class that probably doesn't really belong there. If this bothers you as much as it would me, or if you simply can't add a flag to the stored object (perhaps you don't control the class), you could store these flags in a separate hash table. This requires the most long-term memory use.
Push the keys of the to-be-removed items into a vector or array list while traversing the hash table. After releasing the enumerator, loop through this secondary list and remove the keys from the hash table. If you have a lot of items to remove and/or the keys are large (which they shouldn't be), this may not be the best solution.
If you're going to end up removing more items from the hash table than you're leaving in there, it may be better to create a new hash table, and as you traverse your original one, add to the new hash table only the items you're going to keep. Then replace your reference(s) to the old hash table with the new one. This saves a secondary list iteration, but it's probably only efficient if the new hash table will have significantly fewer items than the original one, and it definitely only works if you can change all the references to the original hash table, of course.
If your hash table gives you access to its collection of keys, you may be able to iterate through those and remove items from the hash table in one pass.
If your hash table or some helper in your library provides you with predicate-based collection modifiers, you may have a Remove() function to which you can pass a lambda expression or function pointer to identify the items to remove.
A common technique when time is a factor is to have a second table of deleted items, and clean up the main table when you have time. Commonly used in search engines.
How about enhancing the hash table to contain pointers like a linked list?
When you insert, if the bucket is full, create a pointer from this bucket to the bucket where the new field in stored.
While deleting something from the hashtable, the solution will be equivalent to how you write a function to delete a node from linkedlist.

Resources