Automata with kleene star - algorithm

Im learning about automata. Can you please help me understand how automata with Kleene closure works? Let's say I have letters a,b,c and I need to find text that ends with Kleene star - like ab*bac - how will it work?

The question seems to be more about how an automaton would handle Kleene closure than what Kleene closure means.
With a simple regular expression, e.g., abc, it's pretty straightforward to design an automaton to recognize it. Each state essentially tells you where you are in the expression so far. State 0 means it's seen nothing yet. State 1 means it's seen a. State 2 means it's seen ab. Etc.
The difficulty with Kleene closure is that a pattern like ab*bc introduces ambiguity. Once the automaton has seen the a and is then faced with a b, it doesn't know whether that b is part of the b* or the literal b that follows it, and it won't know until it reads more symbols--maybe many more.
The simplistic answer is that the automaton simply has a state that literally means it doesn't know yet which path was taken.
In simple cases, you can build this automaton directly. In general cases, you usually build something called a non-deterministic finite automaton. You can either simulate the NDFA, or--if performance is critical--you can apply an algorithm that converts the NDFA to a deterministic one. The algorithm essentially generates all the ambiguous states for you.

The Kleene star('*') means you can have as many occurrences of the character as you want (0 or more).
a* will match any number of a's.
(ab)* will match any number of the string "ab"
If you are trying to match an actual asterisk in an expression, the way you would write it depends entirely on the syntax of the regex you are working with. For the general case, the backwards slash \ is used as an escape character:
\* will match an asterisk.
For recognizing a pattern at the end, use concatenation:
(a U b)*c* will match any string that contains 0 or more 'c's at the end, preceded by any number of a's or b's.
For matching text that ends with a Kleene star, again, you can have 0 or more occurrences of the string:
ab(c)* - Possible matches: ab, abc abcc, abccc, etc.
a(bc)* - Possible matches: a, abc, abcbc, abcbcbc, etc.

Your expression ab*bac in English would read something like:
a followed by 0 or more b followed by bac
strings that would evaluate as a match to the regular expression if used for search
abac
abbbbbbbbbbac
abbac
strings that would not match
abaca //added extra literal
bac //missing leading a
As stated in the previous answer actually searching for a * would require an escape character which is implementation specific and would require knowledge of your language/library of choice.

Related

strong good suffix rule for the Boyer-Moore Algorithm

So one of the exercises in my class was to perform the Boyer-Moore Algorithm with the pattern abc and the string aabcbcbabcabcabc and abababababababab. I was also meant to note the number of comparisons.
Which I had done using the extended bad character rule, however I was told today that one needs to use the strong good suffix rule once a match is found in the string. However, I'm a bit confused on how the strong good suffix rule would be used here as the pattern length is only 3.
When there is a full match, as the pattern abc has no border, when it is encountered in the string would I simply just shift by 3 symbols?
Thank You!

What algorithms can group characters into words?

I have some text generated by some lousy OCR software.
The output contains mixture of words and space-separated characters, which should have been grouped into words. For example,
Expr e s s i o n Syntax
S u m m a r y o f T e r minology
should have been
Expression Syntax
Summary of Terminology
What algorithms can group characters into words?
If I program in Python, C#, Java, C or C++, what libraries provide the implementation of the algorithms?
Thanks.
Minimal approach:
In your input, remove the space before any single letter words. Mark the final words created as part of this somehow (prefix them with a symbol not in the input, for example).
Get a dictionary of English words, sorted longest to shortest.
For each marked word in your input, find the longest match and break that off as a word. Repeat on the characters left over in the original "word" until there's nothing left over. (In the case where there's no match just leave it alone.)
More sophisticated, overkill approach:
The problem of splitting words without spaces is a real-world problem in languages commonly written without spaces, such as Chinese and Japanese. I'm familiar with Japanese so I'll mainly speak with reference to that.
Typical approaches use a dictionary and a sequence model. The model is trained to learn transition properties between labels - part of speech tagging, combined with the dictionary, is used to figure out the relative likelihood of different potential places to split words. Then the most likely sequence of splits for a whole sentence is solved for using (for example) the Viterbi algorithm.
Creating a system like this is almost certainly overkill if you're just cleaning OCR data, but if you're interested it may be worth looking into.
A sample case where the more sophisticated approach will work and the simple one won't:
input: Playforthefunofit
simple output: Play forth efunofit (forth is longer than for)
sophistiated output: Play for the fun of it (forth efunofit is a low-frequency - that is, unnatural - transition, while for the is not)
You can work around the issue with the simple approach to some extent by adding common short-word sequences to your dictionary as units. For example, add forthe as a dictionary word, and split it in a post processing step.
Hope that helps - good luck!

Word Pattern Finder

Problem: find all words that follow a pattern (independently from the actual symbols used to define the pattern).
Almost identical to what this site does: http://design215.com/toolbox/wordpattern.php
Enter patterns like: ABCCDE
This will find words like "bloody",
"kitten", and "valley". The above pattern will NOT find words like
"fennel" or "hippie" because that would require the pattern to be
ABCCBE.
Please note: I need a version of that algorithm that does find words like "fennel" or "hippie" even with an ABCCDE pattern.
To complicate things further, there is the possibility to add known characters anywhere in the searching pattern, for example: cBBX (where c is the known character) will yield cees, coof, cook, cool ...
What I've done so far: I found this answer (Pattern matching for strings independent from symbols) that solves my problem almost perfectly, but if I assign an integer to every word I need to compare, I will encounter two problems.
The first is the number of unique digits I can use. For example, if the pattern is XYZABCDEFG, the equivalent digit pattern will be 1 2 3 4 5 6 7 8 9 and then? 10? Consider that I would use the digit 0 to indicate a known character (for example, aBe --> 010 --> 10). Using hexadecimal digits will move the problem further, but will not solve it.
The second problem is the maximum length of the pattern: a Long in Java is 19-digit long, and I need no restriction in my patterns (although I don't think there exists a word with 20 different characters).
To solve those problems, I could store each digit of the pattern in an array, but then it becomes an array-to-array comparison instead of an integer comparison, thus taking a lot more time to compute.
As a side note: according to the algorithm used, what data structure will be the best suited for storing the dictionary? I was thinking about using an hash-map, converting each word into its digit-pattern equivalent (assuming no known character) and using this number as an hash (of course, there would be a lot of collisions). In that way searching will require first to match the numeric pattern, and then to scan the results to find all the words that have the known characters at the right place (if present in the original searching pattern).
Also, the dictionary is not static: words can be added and deleted.
EDIT:
This answer (https://stackoverflow.com/a/44604329/4452829) works fairly well and it's fast (testing for equal lengths before matching the patterns). The only problem is that I need a version of that algorithm that find words like "fennel" or "hippie" even with an ABCCDE pattern.
I've already implemented a way to check for known characters.
EDIT 2:
Ok, by checking if each character in the pattern is greater or equal than the corresponding character in the current word (normalized as a temporary pattern) I am almost done: it correctly matches the search pattern ABCA with the word ABBA and it correctly ignores the word ABAC. The last problem remaining is that if (for example) the pattern is ABBA it will match the word ABAA, and that's not correct.
EDIT 3:
Meh, not pretty but it seems to be working (I'm using Python because it's fast to code with it). Also, the search pattern can be any sequence of symbols, using lowercase letters as fixed characters and everything else as wildcards; there is also no need to convert each word in an abstract pattern.
def match(pattern, fixed_chars, word):
d = dict()
if len(pattern) != len(word):
return False
if check_fixed_char(word, fixed_chars) is False:
return False
for i in range(0, len(pattern)):
cp = pattern[i]
cw = word[i]
if cp in d:
if d[cp] != cw:
return False
else:
d[cp] = cw
if cp > cw:
return False
return True
A long time ago I wrote a program for solving cryptograms which was based on the same concept (generating word patterns such that "kitten" and "valley" both map to "abccde."
My technique did involve generating a sort of index of words by pattern.
The core abstraction function looks like:
#!python
#!/usr/bin/env python
import string
def abstract(word):
'''find abstract word pattern
dog or cat -> abc, book or feel -> abbc
'''
al = list(string.ascii_lowercase)
d = dict
for i in word:
if i not in d:
d[i] = al.pop(0)
return ''.join([d[i] for i in word])
From there building our index is pretty easy. Assume we have a file like /usr/share/dict/words (commonly found on Unix-like systems including MacOS X and Linux):
#!/usr/bin/python
words_by_pattern = dict()
words = set()
with open('/usr/share/dict/words') as f:
for each in f:
words.add(each.strip().lower())
for each in sorted(words):
pattern = abstract(each)
if pattern not in words_by_pattern:
words_by_pattern[pattern] = list()
words_by_pattern[pattern].append(each)
... that takes less than two seconds on my laptop for about 234,000 "words" (Although you might want to use a more refined or constrained word list for your application).
Another interesting trick at this point is to find the patterns which are most unique (returns the fewest possible words). We can create a histogram of patterns thus:
histogram = [(len(words_by_pattern[x]),x) for x in words_by_pattern.keys()]
histogram.sort()
I find that the this gives me:
8077 abcdef
7882 abcdefg
6373 abcde
6074 abcdefgh
3835 abcd
3765 abcdefghi
1794 abcdefghij
1175 abc
1159 abccde
925 abccdef
Note that abc, abcd, and abcde are all in the top ten. In other words the most common letter patterns for words include all of those with no repeats among 3 to 10 characters.
You can also look at the histogram of the histogram. In other words how many patterns only show one word: for example aabca only matches "eerie" and aabcb only matches "llama". There are over 48,000 patterns with only a single matching word and almost six thousand with just two words and so on.
Note: I don't use digits; I use letters to create the pattern mappings.
I don't know if this helps with your project at all; but this are very simple snippets of code. (They're intentionally verbose).
This can easily be achieved through using Regular Expressions.
For example, the below pattern matches any word that has ABCCDE pattern:
(?:([A-z])(?!\1)([A-z])(?!\1|\2)([A-z])(?=\3)([A-z])(?!\1|\2|\3|\5)([A-z])(?!\1|\2|\3|\5|\6)([A-z]))
And this one matches ABCCBE:
(?:([A-z])(?!\1)([A-z])(?!\1|\2)([A-z])(?=\3)([A-z])(?=\2)([A-z])(?!\1|\2|\3|\5|\6)([A-z]))
To cover both above pattern, you can use:
(?:([A-z])(?!\1)([A-z])(?!\1|\2)([A-z])(?=\3)([A-z])(?(?=\2)|(?!\1\2\3\5))([A-z])(?!\1|\2|\3|\5|\6)([A-z]))
Going this path, your challenge would be generating the above Regex pattern out of the alphabetic notation you used.
And please note that you may want to use the i Regex flag when using these if case-insensitivity is a requirement.
For more Regex info, take a look at:
Look-around
Back-referencing

Is there a method to find the most specific pattern for a string?

I'm wondering whether there is a way to generate the most specific regular expression (if such a thing exists) that matches a given string. Here's an illustration of what I want the method to do:
str = "(17 + 31)"
find_pattern(str)
# => /^\(\d+ \+ \d+\)$/ (or something more specific)
My intuition was to use Regex.new to accumulate the desired pattern by looping through str and checking for known patterns like \d, \s, and so on. I suspect there is an easy way for doing this.
This is in essence an algorithm compression problem. The simplest way to match a list of known strings is to use Regexp.union factory method, but that just tries each string in turn, it does not do anything "clever":
combined_rx = Regexp.union( "(17 + 31)", "(17 + 45)" )
=> /\(17\ \+\ 31\)|\(17\ \+\ 45\)/
This can still be useful to construct multi-stage validators, without you needing to write loops to check them all.
However, a generic pattern matcher that could figure out what you mean to match from examples is not really possible. There are too many ways in which you could consider strings to be similar or not. The closest I could think of would be genetic programming where you supply a large list of should match/should not match strings and the code guesses at the best regex by constructing random Regexp objects (a challenge in itself) and seeing how accurately they match and don't match your examples. The best matchers could be combined and mutated and tried again until you got 100% accuracy. This might be a fun project, but ultimately much more effort for most purposes than writing the regular expressions yourself from a description of the problem.
If your problem is heavily constrained - e.g. any example integer could always be replaced by \d+, any example space by \s+ etc, then you could work through the string replacing "matchable units", in fact using the same regular expressions checked in turn. E.g. if you match \A\d+ then consume the match from the string, and add \d+ to your regex. Then take the remainder of the string and look for next matching pattern. Working this way will have its limitations (you must know the full set of patterns you want to match in advance, and all examples would have to be unambiguous). However, it is more tractable than a genetic program.

Treetop backtracking similar to regex?

Everything I've read suggests Treetop backtracks like regular expressions, but I'm having a hard time making that work.
Suppose I have the following grammar:
grammar TestGrammar
rule open_close
'{' .+ '}'
end
end
This does not match the string {abc}. I suspect that's because the .+ is consuming everything from the letter a onwards. I.e. it's consuming abc} when I only want it to consume abc.
This appears different from what a similar regex does. The regex /{.+}/ will match {abc}. It's my understanding that this is possible because the regex engine backtracks after consuming the closing } as part of the .+ and then failing to match.
So can Treetop do backtracking like that? If so, how?
I know you can use negation to match "anything other than a }." But that's not my intention. Suppose I want to be able to match the string {ab}c}. The tokens I want in that case are the opening {, a middle string of ab}c, and the closing }. This is a contrived example, but it becomes very relevant when working with nested expressions like {a b {c d}}.
Treetop is an implementation of a Parsing Expression Grammar parser. One of the benefits of PEGs is their combination of flexibility, speed, and memory requirements. However, this balancing act has some tradeoffs.
Quoting from the Wikipedia article:
The zero-or-more, one-or-more, and optional operators consume zero or more, one or more, or zero or one consecutive repetitions of their sub-expression e, respectively. Unlike in context-free grammars and regular expressions, however, these operators always behave greedily, consuming as much input as possible and never backtracking. […] the expression (a* a) will always fail because the first part (a*) will never leave any a's for the second part to match.
(Emphasis mine.)
In short: while certain PEG operators can backtrack in an attempt to take another route, the + operator cannot.
Instead, in order to match nested sub-expressions, you want to create an alternation between the delimited sub-expression (checked first) followed by the non-expression characters. Something like (untested):
grammar TestGrammar
rule open_close
'{' contents '}'
end
rule contents
open_close / non_brackets
end
rule non_brackets
# …
end
end

Resources