Boyer Moore Algorithm Understanding and Example? - algorithm

I am facing issues in understanding Boyer Moore String Search algorithm.
I am following the following document. Link
I am not able to work out my way as to exactly what is the real meaning of delta1 and delta2 here, and how are they applying this to find string search algorithm.
Language looked little vague..
Kindly if anybody out there can help me out in understanding this, it would be really helpful.
Or, if you know of any other link or document available that is easy to understand, then please share.
Thanks in advance.

The insight behind Boyer-Moore is that if you start searching for a pattern in a string starting with the last character in the pattern, you can jump your search forward multiple characters when you hit a mismatch.
Let's say our pattern p is the sequence of characters p1, p2, ..., pn and we are searching a string s, currently with p aligned so that pn is at index i in s.
E.g.:
s = WHICH FINALLY HALTS. AT THAT POINT...
p = AT THAT
i = ^
The B-M paper makes the following observations:
(1) if we try matching a character that is not in p then we can jump forward n characters:
'F' is not in p, hence we advance n characters:
s = WHICH FINALLY HALTS. AT THAT POINT...
p = AT THAT
i = ^
(2) if we try matching a character whose last position is k from the end of p then we can jump forward k characters:
' 's last position in p is 4 from the end, hence we advance 4 characters:
s = WHICH FINALLY HALTS. AT THAT POINT...
p = AT THAT
i = ^
Now we scan backwards from i until we either succeed or we hit a mismatch.
(3a) if the mismatch occurs k characters from the start of p and the mismatched character is not in p, then we can advance (at least) k characters.
'L' is not in p and the mismatch occurred against p6, hence we can advance (at least) 6 characters:
s = WHICH FINALLY HALTS. AT THAT POINT...
p = AT THAT
i = ^
However, we can actually do better than this.
(3b) since we know that at the old i we'd already matched some characters (1 in this case). If the matched characters don't match the start of p, then we can actually jump forward a little more (this extra distance is called 'delta2' in the paper):
s = WHICH FINALLY HALTS. AT THAT POINT...
p = AT THAT
i = ^
At this point, observation (2) applies again, giving
s = WHICH FINALLY HALTS. AT THAT POINT...
p = AT THAT
i = ^
and bingo! We're done.

The algorithm is based on a simple principle. Suppose that I'm trying to match a substring of length m. I'm going to first look at character at index m. If that character is not in my string, I know that the substring I want can't start in characters at indices 1, 2, ... , m.
If that character is in my string, I'll assume that it is at the last place in my string that it can be. I'll then jump back and start trying to match my string from that possible starting place. This piece of information is my first table.
Once I start matching from the beginning of the substring, when I find a mismatch, I can't just start from scratch. I could be partially through a match starting at a different point. For instance if I'm trying to match anand in ananand successfully match, anan, realize that the following a is not a d, but I've just matched an, and so I should jump back to trying to match my third character in my substring. This, "If I fail after matching x characters, I could be on the y'th character of a match" information is stored in the second table.
Note that when I fail to match the second table knows how far along in a match I might be based on what I just matched. The first table knows how far back I might be based on the character that I just saw which I failed to match. You want to use the more pessimistic of those two pieces of information.
With this in mind the algorithm works like this:
start at beginning of string
start at beginning of match
while not at the end of the string:
if match_position is 0:
Jump ahead m characters
Look at character, jump back based on table 1
If match the first character:
advance match position
advance string position
else if I match:
if I reached the end of the match:
FOUND MATCH - return
else:
advance string position and match position.
else:
pos1 = table1[ character I failed to match ]
pos2 = table2[ how far into the match I am ]
if pos1 < pos2:
jump back pos1 in string
set match position at beginning
else:
set match position to pos2
FAILED TO MATCH

Related

Change specific index of string, padding if necessary

I have a string called indicators, that the original developer of this application used to store single characters to indicate certain components of a model. I need to change the 7th character in the string, which I tried to do with the following code:
indicators[6] = "R"
The problem, I discovered quickly, was that the string is not always 7 characters long. For example, I have one set of values with U 2, that I need to convert to U 2 R (adding an additional space after the 2). Is there an easy way to force character count with Ruby?
use String.ljust(integer, padstr=' ')
If integer is greater than the length of [the receiver], returns a new String of
length integer with [the return value] left justified and padded with padstr;
otherwise, returns [an unmodified version of the receiver].
indicators = indicators.ljust(7)
indicators[6] = "R"

String of words - DP

I have a string of words and I must determine the longest substring so that the latest 2 letters of a word must be the first 2 letters of a word after it.
For example, for the words:
star, artifact, book, ctenophore, list, reply
Edit: So the longest substring would be star, artifact, ctenophore, reply
I'm looking for an idea to solve this problem in O(n). No code, I appreciate any sugestions on how to solve it.
The closest thing to O(n) I have is this :
You should mark every word with an Id. Let's take your example :
star => 1st substring possible. Since you're looking for the longest substring, if a substring stars with ar, it's not the longest, because you can add star in the front.
let's set the star ID to 1, and its string comparison is ar
artifact => the two first character matches the first possible substring. let's set the artifact ID to 1 as well, and change the string comparison to ct
book => the two first character don't match anything in the string comparisons (there's only ct there), so we set the book ID to 2, and we add a new string comparison : ok
...
list => the first two character don't match anything in the string comparisons (re from ID == 1 and ok from ID ==2 ), so we create another ID = 3 and another string comparison
In the end, you just need to go through the IDs and see which one has the most elements. You can probably count it as you go as well.
The main idea of this algorithm is to memorize every substring we're looking for. If we find a match, we just update the right substring with the two new last characters, and if we don't, we add it to the "memory list"
Repeating this procedure makes it O(n*m), with m the number of different IDs.
First, read in all words into a structure. (You don't really need to, but it's easier to work that way. You could also read them in as you go.)
Idea is to have a lookup table (such as a Dictionary in .NET), which will contain key value pairs such that each two last letters of a word will have an entry in this lookup table, and their corresponding value will always be the longest 'substring' found so far.
Time complexity is O(n) - you only go through the list once.
Logic:
maxWord <- ""
word <- read next word
initial <- get first two letters of word
end <- get last two letters of word
if lookup contains key initial //that is the longest string so far... add to it
newWord <- lookup [initial] value + ", " + word
if lookup doesn't contain key end //nothing ends with these two letters so far
lookup add (end, newWord) pair
else if lookup [end] value length < newWord length //if this will be the longest string ending in these two letters, we replace the previous one
lookup [end] <- newWord
if maxWord length < newWord length //put this code here so you don't have to run through the lookup table again and find it when you finish
maxWord <- newWord
else //lookup doesn't contain initial, we use only the word, and similar to above, check if it's the longest that ends with these two letters
if lookup doesn't contain key end
lookup add (end, word) pair
else if lookup [end] value length < word length
lookup [end] <- word
if maxWord length < word length
maxWord <- word
The maxWord variable will contain the longest string.
Here is the actual working code in C#, if you want it: http://pastebin.com/7wzdW9Es

How does this gsub and regex work?

I'm trying to learn ruby and having a hard time figuring out what each individual part of this code is doing. Specifically, how does the global subbing determine whether two sequential numbers are both one of these values [13579] and how does it add a dash (-) in between them?
def DashInsert(num)
num_str = num.to_s
num_str.gsub(/([13579])(?=[13579])/, '\1-')
end
num_str.gsub(/([13579])(?=[13579])/, '\1-')
() called capturing group, which captures the characters matched by the pattern present inside the capturing group. So the pattern present inside the capturing group is [13579] which matches a single digit from the given set of digits. That corresponding digit was captured and stored inside index 1.
(?=[13579]) Positive lookahead which asserts that the match must be followed by the character or string matched by the pattern inside the lookahead. Replacement will occur only if this condition is satisfied.
\1 refers the characters which are present inside the group index 1.
Example:
> "13".gsub(/([13579])(?=[13579])/, '\1-')
=> "1-3"
You may start with some random tests:
def DashInsert(num)
num_str = num.to_s
num_str.gsub(/([13579])(?=[13579])/, '\1-')
end
10.times{
x = rand(10000)
puts "%6i: %6s" % [x,DashInsert(x)]
}
Example:
9633: 963-3
7774: 7-7-74
6826: 6826
7386: 7-386
2145: 2145
7806: 7806
9499: 949-9
4117: 41-1-7
4920: 4920
14: 14
And now to check the regex.
([13579]) take any odd number and remember it (it can be used later with \1
(?=[13579]) Check if the next number is also odd, but don't take it (it still remains in the string)
'\1-' Output the first odd num and ab a - to it.
In other word:
Puts a - between each two odds numbers.

What does the /^ symbol mean in Ruby?

A friend of mine is trying to explain to me the answer to this problem:
Define a method binary_multiple_of_4?(s) that takes a string and returns true if the string represents a binary number that is a multiple of 4.
However, his example he gave is this:
if (s) == "0"
return true
end
if /^[01]*(00)$/.match(s) #|| /^0$/.match(s)
return true
else
return false
end
It works, because the software we use says there were no errors, but I don't understand why, or what /^ means, and how it's used.
If you could also explain the /^0$/.match(s), that would be great too.
Thanks!
what he is doing is using regular expressions, see: http://www.tutorialspoint.com/ruby/ruby_regular_expressions.htm
To break it down, there is a pattern that is matched inside the slashes /pattern/ and every character means something. ^ means start of the line [01] means match a 0 or a 1, * means match the previous thing ([01]) zero or more times, and (00) means match 00, and $ means match the end of the line.
If you want to know what /^0$/ matches, you should definitely try to figure it out based on the information in my post or the link I provided. Here's the answer though (hover to view):
It matches the beginning of the line, zero, the end of a line.

Regular expression to match my pattern of words, wild chars

can you help me with this:
I want a regular expression for my Ruby program to match a word with the below pattern
Pattern has
List of letters ( For example. ABCC => 1 A, 1 B, 2 C )
N Wild Card Charaters ( N can be 0 or 1 or 2)
A fixed word (for example “XY”).
Rules:
Regarding the List of letters, it should match words with
a. 0 or 1 A
b. 0 or 1 B
c. 0 or 1 or 2 C
Based on the value of N, there can be 0 or 1 or 2 wild chars
Fixed word is always in the order it is given.
The combination of all these can be in any order and should match words like below
ABWXY ( if wild char = 1)
BAXY
CXYCB
But not words with 2 A’s or 2 B’s
I am using the pattern like ^[ABCC]*.XY$
But it looks for words with more than 1 A, or 1 B or 2 C's and also looks for words which end with XY, I want all words which have XY in any place and letters and wild chars in any postion.
If it HAS to be a regex, the following could be used:
if subject =~
/^ # start of string
(?!(?:[^A]*A){2}) # assert that there are less than two As
(?!(?:[^B]*B){2}) # and less than two Bs
(?!(?:[^C]*C){3}) # and less than three Cs
(?!(?:[ABCXY]*[^ABCXY]){3}) # and less than three non-ABCXY characters
(?=.*XY) # and that XY is contained in the string.
/x
# Successful match
else
# Match attempt failed
end
This assumes that none of the characters A, B, C, X, or Y are allowed as wildcards.
I consider myself to be fairly good with regular expressions and I can't think of a way to do what you're asking. Regular expressions look for patterns and what you seem to want is quite a few different patterns. It might be more appropriate to in your case to write a function which splits the string into characters and count what you have so you can satisfy your criteria.
Just to give an example of your problem, a regex like /[abc]/ will match every single occurrence of a, b and c regardless of how many times those letters appear in the string. You can try /c{1,2}/ and it will match "c", "cc", and "ccc". It matches the last case because you have a pattern of 1 c and 2 c's in "ccc".
One thing I have found invaluable when developing and debugging regular expressions is rubular.com. Try some examples and I think you'll see what you're up against.
I don't know if this is really any help but it might help you choose a direction.
You need to break out your pattern properly. In regexp terms, [ABCC] means "any one of A, B or C" where the duplicate C is ignored. It's a set operator, not a grouping operator like () is.
What you seem to be describing is creating a regexp based on parameters. You can do this by passing a string to Regexp.new and using the result.
An example is roughly:
def match_for_options(options)
pattern = '^'
pattern << 'A' * options[:a] if (options[:a])
pattern << 'B' * options[:b] if (options[:b])
pattern << 'C' * options[:c] if (options[:c])
Regexp.new(pattern)
end
You'd use it something like this:
if (match_for_options(:a => 1, :c => 2).match('ACC'))
# ...
end
Since you want to allow these "elements" to appear in any order, you might be better off writing a bit of Ruby code that goes through the string from beginning to end and counts the number of As, Bs, and Cs, finds whether it contains your desired substring. If the number of As, Bs, and Cs, is in your desired limits, and it contains the desired substring, and its length (i.e. the number of characters) is equal to the length of the desired substring, plus # of As, plus # of Bs, plus # of Cs, plus at most N characters more than that, then the string is good, otherwise it is bad. Actually, to be careful, you should first search for your desired substring and then remove it from the original string, then count # of As, Bs, and Cs, because otherwise you may unintentionally count the As, Bs, and Cs that appear in your desired string, if there are any there.
You can do what you want with a regular expression, but it would be a long ugly regular expression. Why? Because you would need a separate "case" in the regular expression for each of the possible orders of the elements. For example, the regular expression "^ABC..XY$" will match any string beginning with "ABC" and ending with "XY" and having two wild card characters in the middle. But only in that order. If you want a regular expression for all possible orders, you'd need to list all of those orders in the regular expression, e.g. it would begin something like "^(ABC..XY|ACB..XY|BAC..XY|BCA..XY|" and go on from there, with about 5! = 120 different orders for that list of 5 elements, then you'd need more for the cases where there was no A, then more for cases where there was no B, etc. I think a regular expression is the wrong tool for the job here.

Resources