Chess: Extracting the principal variation from the transposition table - algorithm

Earlier, I was having an issue involving my principal variation becoming truncated by an alpha-beta search. Indeed, this appears to be a common issue. From the authors of Crafty:
Another solution with even worse properties is to extract the full PV
from the transposition table, and avoid using the triangular array
completely. If the transposition table is large enough so that
nothing gets overwritten at all, this would almost work. But there is
a hidden “gotcha”. Once you search the PV, you continue to search
other parts of the tree, and you will frequently encounter some of
those same positions again, but the hash draft will not be deep enough
to allow the entry to be used. You continue searching and now have to
overwrite that entry (exact hash signature match requires this) and
you now leave a transposition “trail” to a different endpoint, one
which does not match the original score. Or any one of the trans/ref
entries between the root and the actual PV endpoint get overwritten by
a different position completely, and you now have no way to search the
transposition table to find the actual endpoint you want to display.
Several use this approach to produce the PV, and the complaints are
generally frequent and loud, because an evaluation paired with an
unrelated PV is not very useful, either for debugging or for analyzing
your own games.
This makes a lot of sense.
Consider that the principal variation is ABCDEF, but AB returns the board to its original position. Then, an alternative line examined later might be CDEFGH, which results in a different evaluation than the earlier search of just CDEF. Thus, the transposition table entry for the board state after AB is overwritten, potentially with a node that will be cut off by alpha-beta (!!), and the PV of ABCDEF is destroyed forever.
Is there any way around this problem, or do I have to use an external data structure to save the PV?
Specifically, what is wrong with replacing if and only if the new entry is deeper and exact? This doesn't seem to work, but I'm not sure why.

Related

Will key in the index be removed after deletion in B Plus tree?

I'm a little confused with the deletion in B+ tree. I searched a lot in Google and find that there are two implementation when the key you want to delete appears in the index:
Delete the key in the index
Keep the key in the index
Algorithm from https://www.javatpoint.com/b-plus-tree-deletion uses the first way.
Algorithm from https://www.cs.princeton.edu/courses/archive/fall08/cos597A/Notes/BplusInsertDelete.pdf uses the second way.
So I really want to know which one is right.
But I'm more inclined to take that as an undefined behavior. At this point, could someone help me figure out the advantage and disadvantage between them? And how to choose between them?
Thanks in advance.
Both methods are correct.
The difference that you highlight is not so much in deleting/not-deleting internal keys, but in updating/not-updating them.
Obviously, when you delete a value (i.e. a key in a leaf node), the b-plus-tree property is not violated: all child values are still within the range dictated by the parent information. You can never break this range-rule by merely removing a value from a leaf. This rule is also still valid when you update the internal key(s) in the path to that leaf (according to method 1), which is only necessary when the deleted value was the left-most one in its leaf.
Note that the two methods may produce quite different trees after a long sequence of the same operations (insert, delete).
But on average the second method will have slightly less work to do. This difference is not significant though.

Logoot CRDT: interleaving of data on concurrent edits to the same spot?

I want to implement Logoot for eventually-convergent P2P text editing and I've run into a bit of a problem.
My understanding of Logoot is that the intervals between objects (lines of text in the original paper, but could be characters or words) can be divided infinitely on account of an unbounded identifier. This means that the position of an object is determined not by its neighbors as in WOOT (which would require tombstones) but by a fixed numerical point along the length of the string. Combined with a unique site identifier, this also gives us a total order and enables eventual convergence.
However... doesn't this cause a problem when concurrent edits are made to the same spot? If two disconnected clients start writing new sentences at the same cursor position and then merge, their sentences have a good chance of interleaving.
Below is a whiteboard example of what I'm talking about:
As you can see, both site B and site C divide the interval between "I" and "conquered" according to the rules of Logoot, giving us random points between the positions of (20,A) and (25,A). But nothing orders these points relative to each other, causing them to mix when merged. Meanwhile, neighbor-based algorithms can account for this issue since the causality chain of each object is preserved.
The above is a baby example, but in the more general case, imagine if two users wanted to insert a different sentence between two existing sentences. If one of the users happened to be offline, they shouldn't come back to a garbled mess! Clearly, to preserve intent, one sentence should follow the other.
Am I missing something in my reading of the paper, or is this an inherent downside to Logoot?
(Also, why is there a recorded clock value that's seemingly unused in the algorithm? The paper even points out that each object's identifier is necessarily unique without the clock.)
You're correct, this a real anomaly in Logoot and LSEQ. Whether or not it constitutes a intention violation depends on what your definition of intention is. An extension to the definition requiring that contiguous sequences remain contiguous unless they are split by a casually subsequent operation would make intuitive sense.
The clock is unnecessary. Most likely the authors used the (site, clock) pair or Lamport timestamp as their UUIDs out of convention. One site can never create two identical positions, so clocks will never need to be compared. (Assuming messages are received from a site in order, which is required for other aspects of Logoot/LSEQ too.)

Fast algorithm for approximate lookup on multiple keys

I have formulated a solution to a problem where I am storing parameters in a set of tables, and I want to be able to look up the parameters based on multiple criteria.
For example, if criteria 1 and criteria 2 can each be either A or B, then I'd have four potential parameters - one for each combination A&A, A&B, B&A and B&B. For these sort of criteria I could concatenate the fields or something similar and create a unique key to look up each value quickly.
Unfortunately not all of my criteria are like this. Some of the criteria are numerical and I only care about whether or not a result sits above or below a boundary. That also wouldn't be a problem on its own - I could maybe use a binary search or something relatively quick to find the nearest key above or below my value.
My problem is I need to include a number of each in the same table. In other words, I could have three criteria - two with A/B entries, and one with less-than-x/greater-than-x type entries, where x is in no way fixed. So in this example I would have a table with 8 entries. I can't just do a binary search for the boundary because the closest boundary won't necessarily be applicable due to the other criteria. For example, if the first two criteria are A&B, then the closest boundary might be 100, but if the if first two criteria are A&A, the closest boundary might be 50. If I want to look up A, A, 101, then I want it to recognise that 50 is the closest boundary that applies - not 100.
I have a procedure to do the lookup but it gets very slow as the tables get bigger - it basically goes through each criteria, checks if a match is still possible, and if so it looks at more criteria - if not, it moves on to check the next entry in the table. So in other words, my procedure requires cycling through the table entries one by one and checking for a match. I have tried to optimise that by ensuring the tables that are input to the procedure are as small as possible and by making sure it looks at the criteria that are least likely to match first (so that it checks each entry as quickly as possible) but it is still very slow.
The biggest tables are maybe 200 rows with about 10 criteria to check, but many are much smaller (maybe 10x5). The issue is that I need to call the procedure many times during my application, so algorithms with some initial overhead don't necessarily make things better. I do have some scope to change the format of the tables before runtime but I would like to keep away from that as much as possible (while recognising it may be the only way forward).
I've done quite a bit of research but I haven't had any luck. Does anyone know of any algorithms that have been designed to tackle this kind of problem? I was really hoping that there would be some clever hash function or something that means I won't have to cycle through the tables, but from my limited knowledge something like that would struggle here. I feel confident that I understand the problem well enough to gradually optimise the solution I have at the moment, but I want to be sure I've not missed a much better solution.
Apologies for the very long and abstract description of the problem - hopefully it's clear what I'm trying to do. I'll amend my question if it's unclear.
Thanks for any help.
this is basically what a query optimizer does in SQL land. There are fast, free, in memory databases for exactly this purpose. Checkout sqlite https://www.sqlite.org/inmemorydb.html.
It sounds like you are doing what is called a 'full table scan' for each query, which is like the last resort for a query optimizer.
As I've understood, you want to select entries by criteria like
A& not B & x1 >= lower_x1 & x1 < upper_x1 & x2 >= lower_x2 & x2 < lower_x2 & ...
The easiest way is to have them sorted by all possible xi, where i=1,2.. in separate sets, and have separated 'words' for various combination of A,B,..
The search will works as follows:
Select a proper world by Boolean criteria combination
For each i, find the population of lower_xi..upper_xi range in corresponding set (this operation is O(log(N))
Select i where the population is the lowest
While iterating instances through lower_xi..upper_xi range, filter the results by checking other upper/lower bound criteria (for all xj where j!=i)
Note that this s a general solution. Of course if you know some relation between your bound(s), you may use a list sorted by respective combination(s) of item values.

Constant-time Spelling Correction on Ten Million Entities

I have a list of ~10M entities. I need to match an entity that a user types out with an entity from the list. Users often misspell the entities (ie. orang instead of orange). I need to correct 1-2 instances of letter replacement (aca instead of aba), letter insertion (aca instead of ac), and letter deletion (aca instead of acca). I want to do this in constant time with respect to the size of the entity list.
Making a dictionary of all possible spellings that are off by 1-2 letters would be constant time but requires an intractably large amount of memory. Edit distance is linear in time with respect to the size of the entity list. I'm thinking there is probably a clever algorithm to prune down the candidate matches to <100 (maybe via a clever hash of the letters in the entity). Then I could run edit distance on the small set of candidates.
Does anyone know of a technique that will work here?
In addition to the linked document in Matt's comment (suggesting to only generate/compare/search via deletions), you can try using a DAWG aka MADFA aka DAFSA to store all possible distance=2 words. For example, for Python there's pyDAWG. Not sure if the space savings will be sufficient to meet your needs, as that depends on language, but if you affixes are similar, it could be quite significant: Each substitution/deletion is just an extra arc, and each insertion is only one more node.

Is there a specific scenario of a hash table that isn't full yet an insertion can't occur?

What I mean to ask is for a hash-table following the standard size of a prime number, is it possible to have some scenario (of inserted keys) where no further insertion of a given element is possible even though there's some empty slots? What kind of hash-function would achieve that?
So, most hash functions allow for collisions ("Hash Collisions" is the phrase you should google to understand this better, by the way.) Collisions are handled by having a secondary data structure, like a list, to store all of the values inserted at keys with the same hash.
Because these data structures can generally store arbitrarily many elements, you will always be able to insert into the hash table, but the performance will get worse and worse, approaching the performance of the backing data structure.
If you do not have a backing data structure, then you can be unable to insert as soon as two things get added to the same position. Since a good hash function distributes things evenly and effectively randomly, this would happen pretty quickly (see "The Birthday Problem").
There are failure-to-insert scenarios for some but not all hash table implementations.
For example, closed hashing aka open addressing implementations use some logic to create a sequence of buckets in which they'll "probe" for values not found at the hashed-to bucket due to collisions. In the real world, sometimes the sequence-creation is pretty basic, for example:
the programmer might have hard-coded N prime numbers, thinking the odds of adding in each of those in turn and still not finding an empty bucket are low (but a malicious user who knows the hash table design may be able to calculate values to make the table fail, or it may simply be so full that the odds are no longer good, or - while emptier - a statistical freak event)
the programmer might have done something like picked a prime number they liked - say 13903 - to add to the last-probed bucket each time until a free one is found, but if the table size happens to be 13903 too it'll keep checking the same bucket.
Still, there are probing approaches such as linear probing that guarantee to try all buckets (unless the implementation goes out of its way to put a limit on retries). It has some other "issues" though, and won't always be the best choice.
If a hash table is implemented using open addressing instead of separate chaining, then it is a good idea to leave at least 1 slot empty to simplify the algorithm.
In open addressing when we are trying to find an element, we first compute the hash index i, then check the table at indexes {i, i + 1, i + 2, ... N - 1, (wrapping around) 0, 1, 2, ...}, until we either find the element we want or hit an empty slot. You can see that in this algorithm, if no slot is empty but the element can't be found, then the search would loop forever.
However, I should emphasize that enforcing merely simplifies the search algorithm. Because alternatively, the search algorithm can remember the starting index i, and halt the search if the entire table has been scanned and it lands back at index i.

Resources