Regex that does NOT match something I matched before - ruby

As part of a question I asked earlier today, my goal is to validate all the moves a rook can make in chess notation.
This consists of:
The letter R
An optional disambiguation, the source of the problem (discussed in detail later)
An optional x to indicate a capture was made
The square to which the rook moved (the columns ["files" in chess] are lettered a-h and the rows ["ranks"] are numbered 1-8)
Disregarding disambiguation, we have the simple
/Rx?[a-h][1-8]/
Disambiguation
It often happens that two rooks can move to a square, and one does. When this happens, a disambiguating letter or number is used. So, if two rooks are on d3 and h5, and the one on h5 moves to d5, it is written Rhd5. Similarly, a rook on d8 moving to d3 when another rook is on d1 is written R8d3.
Files take precedence over ranks. In the first example, if the rook on d3 moved to d5, it could be disambiguated as R3d5 or Rdd5. Only the latter is correct.
The limits on rook disambiguation are:
Any letter may be used for file disambiguation, and
Any number may be used for rank disambiguation, but the number of the square moved to must not be 1 or 8 (R3d1 is not valid because of files' precedence over ranks and should be Rdd1), and it must not be the same number as the number of the square (R3d3 is also invalid)
With the above in mind, I constructed this:
/R([a-h]?x?[a-h][1-8]|([1-8])x?[a-h][2-7&&[^\1]])/
The problem lies in the last characters, [2-7&&[^\1]]. Ruby interprets [^\1] literally, that is as all characters other than \ or 1. If I try putting the \1 outside the brackets ([2-7&&[^]\1]), Ruby complains about the character class with no elements. And if I use an arbitrary placeholder that will never occur, say "z" ([2-7&&[^z]\1]), it doesn't work (I can't explain why)
So how can I use grouping to NOT match what I matched before?

Your question is long and dense, so I will address the core question and let you implement the technique:
How can I use grouping to NOT match what I matched before?
We'll proceed step by step. The following is not an exact chess example, but an illustration of how to accomplish what you want.
Let's say I want a string that matches letters a through h. My regex is ^[a-h]$
Next I want to match a digit and a dash. My regex becomes ^[a-h][0-9]-$
Next I want to match a letter, but not the one we matched before. My regex becomes ^([a-h])[0-9]-(?!\1)[a-h]$, where the ([a-h]) captures the first letter to Group 1, and the negative lookahead (?!\1) asserts that what follows is not the content of what was matched by Group 1 (i.e., it is not that letter).
Let's add a final digit just for balance: ^([a-h])[0-9]-(?!\1)[a-h][0-9]$. This will match a1-b2 but not a1-a2.
Let me know if you have any questions.
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind

Related

Move any character from one of the string to the end, find at least how many moves you can make so that two strings become the same string

I am doing some coding practice today and I encountered this question which I couldn't find an approach to solve this question, can anyone share their insight about this problem?
You are given two strings S and T, you can move any character(any position) in S to the end, find at least how many moves you can make so that S and T become the same string.
You can assume S, T are in the same length with same characters.
Example:
S: cadb
T: abcd
Output:
2
Explanation:1. Move 'c' to the end first then S became "adbc" 2. Move 'd' to the end then S became "abcd" which is same as T.
Maybe DFS or BFS would help? I don't know...
I came up with a very naive and rough idea when I saw this question for the first time which is move every characters that are not in the same position then check, if the new string is not same then move again, until they become the same.
When you're done, the characters you don't move will come before all the character you do move, and they will match a prefix of the target string.
To move the minimum number of characters, find the longest subsequence of S that is a prefix of T. Then move all the other ones in the right order to match the rest of T. If you can't then there is no match possible.
Easy to do -- you just find characters from T in S in order:
T: lookingForThis
S: ThiloFokrinsgo
^^ ^^ ^^ ^
Keep: looking
Move: ForThis

Check if one string includes a substring with Levenshtein distance of 1 from other string

My problem is that we want our users to enter the code like this:
639195-EM-66-XA-53-WX somewhere in the input, so the result may look like this: The code is 639195-EM-66-XA-53-WX, let me in. We still want to match the string if they make a small error in the code (Levenshtein distance of 1). For example The code is 739195-EM-66-XA-53-WX, let me in. (changed 6 to 7 in the first letter of the code)
The algorithm should match even if user skips dashes, and it should ignore lowercase/uppercase letters. These requirements are easy to fulfil, because I can remove all dashes and do to_uppercase.
Is there an algorithm for something like that?
Generating all strings with the distance of 1 from original code is computationally expensive.
I was also thinking about using something like Levenshtein distance, but ignoring missing letters that user added in the second string, but that would allow wrong letters in the middle of the code.
Searching for the code in user input seems a little bit better, but still not very clean.
I had an idea for a solution, maybe this is good enough for you:
As you said, first remove the dashes and make everything upper (or lower) case:
Sentence: THE CODE IS 639195EM66XA53WX, LET ME IN
Code: 639195EM66XA53WX
Split the code in the middle (c1 and c2), because Levenshtein distance of 1 means that there can only be one mistake (insertion, deletion or replacement of a single character), so one of c1 or c2 has to match if the code is present in the sentence with just 1 or less mistakes. Splitting in the middle because the longer both substrings of the code are the fewer matches you should get:
c1: 639195EM
c2: 66XA53WX
Now try to find c1 and c2 in your sentence, if you find a match then you either have to go forward (c1 matched) or backwards (c2 matched) in the sentence to check if the Levenshtein distance of the missing part is 1 or less.
So in your example you would find c2 and then:
Set pointers to the last character of c1 and the character before the match.
While the characters are the same reduce both pointers by 1 (go backwards in both strings).
If you can consume c1 completely this way you found an exact match (Levenshtein distance of 0).
Otherwise try the 3 possibilities for Levenshtein distance of 1:
Only move the pointer of the c1 backwards and see if the rest matches (deletion).
Only move the pointer of the sentence backwards and see if the rest matches (insertion).
Move both pointers backwards and see if the rest matches (replacement).
If one of them succeeds you found a match with Levenshtein distance of 1, otherwise the distance is higher.

Minimum number of char substitutions to get a palindrome

I would like to solve this problem from TopCoder, in which a String is given and in each step you have to replace all occurrences of an character (of your choice) with another character (of your choice), so that at the end after all steps you get a palindrome. The problem is to identify the minimum total number of replacements.
Ideas so far:
I can identify that the string after every step is simply a node/vertex in a graph and that the cost of every edge is the number of replacements made in the step, but I don't see how to use greedy for that (it is definitely not the Minimum Spanning Tree problem). I don't think it makes sense to identify all possible nodes & edge costs and to convert the problem in the Shortest Path problem. On the other side, I think in every step it makes sense to replace the character X with the biggest number of conflicts, with the character Y in conflict with X that occurs most in the string.
Anyway, I can't either prove that it works. Also I can't identify any known problems in this. Any ideas?
You need to identify disjunct sets of characters. A disjunct set of characters is a set of characters that will all have to become the same character in order for the string to become a palindrome.
Example:
Let's say we have the string abcdefgfmdebac
It has 3 disjunct sets, abc, de and fgm
Algorithm:
Pick the first character and check all occurences of it picking up other characters in the set.
In the example string we start with a and pick up band c (because they sit on the opposite sides of the two ain our string). We repeat the process for band c, but no new characters are added to the set. So abc is our first disjunct set.
Continue doing this with the remaining characters.
A disjunct set of n characters (counting all characters) needs n-m replacements, where m is the number of occurences of the most frequent character.
So simply sum over the sets.
In our example it takes 4 + 2 + 2 = 8 replacements.

Make palindrome from given word

I have given word like abca. I want to know how many letters do I need to add to make it palindrome.
In this case its 1, because if I add b, I get abcba.
First, let's consider an inefficient recursive solution:
Suppose the string is of the form aSb, where a and b are letters and S is a substring.
If a==b, then f(aSb) = f(S).
If a!=b, then you need to add a letter: either add an a at the end, or add a b in the front. We need to try both and see which is better. So in this case, f(aSb) = 1 + min(f(aS), f(Sb)).
This can be implemented with a recursive function which will take exponential time to run.
To improve performance, note that this function will only be called with substrings of the original string. There are only O(n^2) such substrings. So by memoizing the results of this function, we reduce the time taken to O(n^2), at the cost of O(n^2) space.
The basic algorithm would look like this:
Iterate over the half the string and check if a character exists at the appropriate position at the other end (i.e., if you have abca then the first character is an a and the string also ends with a).
If they match, then proceed to the next character.
If they don't match, then note that a character needs to be added.
Note that you can only move backwords from the end when the characters match. For example, if the string is abcdeffeda then the outer characters match. We then need to consider bcdeffed. The outer characters don't match so a b needs to be added. But we don't want to continue with cdeffe (i.e., removing/ignoring both outer characters), we simply remove b and continue with looking at cdeffed. Similarly for c and this means our algorithm returns 2 string modifications and not more.

Solve Hangman in AI way

I named it as "AI way" because I'm thinking make Application to play the hangman game without human being interactive.
The scenario is like this:
a available word list which would contains hundreds of thousands English word.
The Application will pick certain amount of words, e.g 20 from the list.
The Application play Hangman against each word until either WON or FAILURE.
The restriction here is max wrong bad guess.
26 does not make sense obviously and let's say 6 for the max wrong guess.
I tried the strategy mentioned at wiki page but it does not work well.
Basically successful rate is about 30%.
Any suggestions / comments regarding strategy as well as which field I should dig in order to find a fair good strategy?
Thanks a lot.
-Simon
PS: A JavaScript implementation which looks fairly well.
(https://github.com/freizl/play-hangman-game)
Updated Idea
Download a dictionary of words and put it into some database or structure of your choice
When presented with a word, narrow your guesses to words of the same length and perform a letter frequency distribution (you can use a dictionary and/or list collection for fast distribution analysis and sorting)
Pick the most common letter from this list
If the letter is found, create a regex pattern based on the known letter(s) and the word length and repeat from step 2
You should be able to quickly narrow down a single word resulting from your pattern search
For posterity:
Take a look at this wiki page. It includes a table of frequencies of the first letters of words which may help you tune your algorithm.
You could also take into account the fact that if you find a vowel or two in a word the likelihood of finding other vowels will decrease significantly and you should then try more common consonants instead. The example from the wiki page you listed start with E then T and then tries three vowels in a row: A, O and I. The first two letters are missed but once the third letter is found, twice then the process should switch to common consonants and skip trying for more vowels since there will likely be fewer.
Any useful strategies will certainly employ frequency distribution charts on letters and possibly words e.g. some words are very common while others are rarely used so performing a letter frequency distribution on a set of more common words might help... guessing that some words may appear more frequently than other but that depends on your word selection algorithm which might not take into account "common" usage.
You could also build specialized letter frequency tables and possibly even on-the-fly. For example, given the wikipedia h a ngm a n example: You find the letter A twice in a word in two locations 2nd and 6th. You know that the word has seven letters and with a fairly simple reg ex you could isolate the words from a dictionary that match this pattern:
_ a _ _ _ a _
Then perform a letter frequency on that set of words that matches this pattern and use that set for your next guess. Rinse and repeat. I think doing some of those things I mentioned but especially the last will really increase your odds of success.
The strategies in the linked page seem to be "order guesses by letter frequency" and "guess the vowels, then order guesses by letter frequency"
A couple of observations about hangman:
1) Since guessing a letter that isn't in the word hurts us, we should guess letters by word frequency (percentage of words that contain letter X), not letter frequency (number of times that X appears in all words). This should maximise our chances of guessing a bad letter.
2) Once we've guessed some letters correctly, we know more about the word we're trying to guess.
Here are two strategies that should beat the letter frequency strategy. I'm going to assume we have a dictionary of words that might come up.
If we expect the word to be in our dictionary:
1) We know the length of the target word, n. Remove all words in the dictionary that aren't of length n
2) Calculate the word frequency of all letters in the dictionary
3) Guess the most frequent letter that we haven't already guessed.
4) If we guessed correctly, remove all words from the dictionary that don't match the revealed letters.
5) If we guessed incorrectly, remove all words that contain the incorrectly guessed letter
6) Go to step 2
For maximum effect, instead of calculating word frequencies of all letters in step 2, calculate the word frequencies of all letters in positions that are still blank in the target word.
If we don't expect the word to be in our dictionary:
1) From the dictionary, build up a table of n-grams for some value of n (say 2). If you haven't come across n-grams before, they are groups of consecutive letters inside the word. For example, if the word is "word", the 2-grams are {^w,wo,or,rd,d$}, where ^ and $ mark the start and the end of the word. Count the word frequency of these 2-grams.
2) Start by guessing single letters by word frequency as above
3) Once we've had some hits, we can use the table of word frequency of n-grams to determine either letters to eliminate from our guesses, or letters that we're likely to be able to guess. There are a lot of ways you could achieve this:
For example, you could use 2-grams to determine that the blank in w_rd is probably not z. Or, you could determine that the character at the end of the word ___e_ might (say) be d or s.
Alternatively you could use the n-grams to generate the list of possible characters (though this might be expensive for long words). Remember that you can always cross off all n-grams that contain letters you've guessed that aren't in the target word.
Remember that at each step you're trying not to make a wrong guess, since that keeps us alive. If the n-grams tell you that one position is likely to only be (say) a,b or c, and your word frequency table tells you that a appears in 30% of words, but b and c only appear in 10%, then guess a.
For maximum benefit, you could combine the two strategies.
The strategy discussed is one suitable for humans to implement. Since you're writing an AI you can throw computing power at it to get a better result.
Take your word list, filter it down to only those words which match what information you have about the target word. (At the start that will only be the word length.) For each letter A through Z note how many words contain at least one of them (this is different than the count of the letters.) Pick the letter with the highest score.
You MIGHT even be able to run multiple cycles of this in computing a guess but that might prove too much even for modern CPUs.
Clarification: I'm saying that you might be able to run a look-ahead. If we choose "A" at this level what options does that present for the next level? This is an O(x^n) algorithm, obviously you can't go too far down that path.

Resources