Generating correct phrases from PEG grammars - algorithm

I wrote a PEG parser generator just for fun (I will publish it on NPM some time), and thought it would be easy to add a randomised phrase generator on top of it. The idea is to automatically get correct phrases, given a grammar. So I set the following rules to generate strings from each type of parsers :
Sequence p1 p2 ... pn : Generate a phrase for each subparser and return the concatenation.
Alternative p1 | p2 | ... | pn : Randomly pick a subparser and generate a phrase with it.
Repetition p{n, m} : Pick a number x in [n, m] (or [n, n+2] is m === Infinity) and return a concatenation of x generated phrases from p.
Terminal : Just return the terminal literal.
When I take the following grammar :
S: NP VP
PP: P NP
NP: Det N | Det N PP | 'I'
VP: V NP | VP PP
V: 'shot' | 'killed' | 'wounded'
Det: 'an' | 'my'
N: 'elephant' | 'pajamas' | 'cat' | 'dog'
P: 'in' | 'outside'
It works great. Some examples :
my pajamas killed my elephant
an pajamas wounded my pajamas in my pajamas
an dog in I wounded my cat in I outside my elephant in my elephant in an pajamas outside an cat
I wounded my pajamas in my dog
This grammar has a recursion (PP: P NP > NP: Det N PP). When I take this other recursive grammar, for math expression this time :
expr: term (('+' | '-') term)*
term: fact (('*' | '/') fact)*
fact: '1' | '(' expr ')'
Almost one time in two, I get a "Maximum call stack size exceeded" error (in NodeJS). The other half of the time, I get correct expressions :
( 1 ) * 1 + 1
( ( 1 ) / ( 1 + 1 ) - ( 1 / ( 1 * 1 ) ) / ( 1 / 1 - 1 ) ) * 1
( ( ( 1 ) ) )
1
1 / 1
I guess the recursive production for fact gets called too often, too deep in the call stack and this makes the whole thing just blow off.
How can I make my approach less naive in order to avoid those cases that explode the call stack ? Thank you.

Of course if a grammar describes arbitrarily long inputs, you can easily end up in a very deep recursion. A simple way to avoid this trap is keep a priority queue of partially expanded sentential forms where the key is length. Remove the shortest and replace each non-terminal in each possible way, emitting those that are now all terminals and adding the rest back onto the queue. You might also want to maintain an "already emitted" set to avoid emitting duplicates. If the grammar doesn't have anything like epsilon productions where a sentential form derives a shorter string, then this method produces all strings described by the grammar in non-decreasing length order. That is, once you've seen an output of length N, all strings of length N-1 and shorter have already appeared.
Since OP asked about details, here's an implementation for the expression grammar. It's simplified by rewriting the PEG as a CFG.
import heapq
def run():
g = {
'<expr>': [
['<term>'],
['<term>', '+', '<expr>'],
['<term>', '-', '<expr>'],
],
'<term>': [
['<fact>'],
['<fact>', '*', '<term>'],
['<fact>', '/', '<term>'],
],
'<fact>': [
['1'],
['(', '<expr>', ')']
],
}
gen(g)
def is_terminal(s):
for sym in s:
if sym.startswith('<'):
return False;
return True;
def gen(g, lim = 10000):
q = [(1, ['<expr>'])]
n = 0;
while n < lim:
_, s = heapq.heappop(q)
# print("pop: " + ''.join(s))
a = []
b = s.copy()
while b:
sym = b.pop(0)
if sym.startswith('<'):
for rhs in g[sym]:
s_new = a.copy()
s_new.extend(rhs)
s_new.extend(b)
if is_terminal(s_new):
print(''.join(s_new))
n += 1
else:
# print("push: " + ''.join(s_new))
heapq.heappush(q, (len(s_new), s_new))
break # only generate leftmost derivations
a.append(sym)
run()
Uncomment the extra print()s to see heap activity. Some example output:
1
(1)
1*1
1/1
1+1
1-1
((1))
(1*1)
(1/1)
(1)*1
(1)+1
(1)-1
(1)/1
(1+1)
(1-1)
1*(1)
1*1*1
1*1/1
1+(1)
1+1*1
1+1/1
1+1+1
1+1-1
1-(1)
1-1*1
1-1/1
1-1+1
1-1-1
1/(1)
1/1*1
1/1/1
1*1+1
1*1-1
1/1+1
1/1-1
(((1)))
((1*1))
((1/1))
((1))*1
((1))+1
((1))-1
((1))/1
((1)*1)
((1)+1)
((1)-1)
((1)/1)
((1+1))
((1-1))
(1)*(1)
(1)*1*1
(1)*1/1
(1)+(1)
(1)+1*1

Related

Dynamic Programming of Markov Decision Process with Value Iteration

I am learning about MDP's and value iteration in self-study and I hope someone can improve my understanding.
Consider the problem of a 3 sided dice having numbers 1, 2, 3. If you roll a 1 or a 2 you get that value in $ but if you roll a 3 you loose all your money and the game ends (finite horizon problem)
Conceptually I understand how this done with the following forumla:
So let's break that down:
Since this is a finite horizon problem we can ignore gamma.
If I observe 1, I can either go or stop. The utility/value of that is:
V(1) = max(Q(1, g), Q(1, s))
Q(1, g) = r + SUM( P( 2 | 1,g) * V(2) + P( 3 | 1,g) * V(3))
Q(1, s) = r + SUM( P( 2 | 1,s) * V(2) + P( 3 | 1,s) * V(3))
where r = 1
I observe 2, I can either go or stop:
V(2) = max(Q(2, g), Q(2, s))
Q(2, g) = r + SUM( P( 1 | 2,g) * V(1) + P( 3 | 1,g) * V(3))
Q(2, s) = r + SUM( P( 1 | 2,s) * V(1) + P( 3 | 1,s) * V(3))
where r = 2
I observe 3, the game ends.
Intuitively V(3) is 0 because the game is over, so we can remove that half from the equation of Q(1, g). We defined V(2) above also so we can substitute that as:
Q(1, g) = r + SUM( P( 2 | 1,g) *
MAX ((P( 1 | 2,g) * V(1)) , (P( 1 | 2,s) * V(1))))
This where things take a bad turn. I am not sure how to solve Q(1, g) if it has its own definition in its solution. This likely due to poor math background.
What I do understand is that the utilities or the values of the states will change based on the reward and therefore the decision will change.
Specifically if rolling three gave you $3 while rolling one ended the game, that will affect your decision because the utility has changed.
But I am not sure how to write code to calculate that.
Can someone explain how Dynamic Programming works in this? How do I solve Q(1,g) or Q(1,s) when it is in its own definition?
Special solution:
For your example, it is pretty easy to know whether "go" or "stop" should be chosen: there is a money-value X for which it is the same whether you "go" or "stop", for all smaller value you should "go", for all bigger values you should stop. So the only question, what is this value:
X=E("stop"|X)=E("go"|X)=1/3(1+X)+1/3(2+x) =>
1/3X=1 =>
X=3
Already in the first line, I used that even if I choose "go" and win I will choose stop in the next round. So knowing what decision should be made, it is easy to calculate the expected win with the perfect strategy, here in python:
def calc(money):
PROB=1.0/3.0
if money<3:#go
return PROB*calc(money+1)+PROB*calc(money+2)-PROB*0
else:#stop
return money
print "Expected win:", calc(0)
>>> Expected win: 1.37037037037
General solution:
I'm not sure the above course of action can be generalized for arbitrary scenarios. However, there is another possibility to solve such problems.
Let's change the game a little bit: No longer infinitely many turns are possible, but at most N turns. Then your recursion becomes:
E(money, N)=max(money, 1/3*E(money+1, N-1)+1/3*E(money+1, N-1))
As you can easily see the value E(money, N) no longer depends on itself but on results of a game with smaller number of turns.
Without a proof, I state, that the value you are looking for is E(money)=lim_{N->infinity} E(money, N).
For you special problem the python code would look like follows:
PROB=1.0/3.0
MAX_GOS=20#neglect all possibilities with more than 1000 decisions "GO"
LENGTH=2*MAX_GOS+1#per go 2$ are possible
#What is expected value if the game ended now?
expected=range(LENGTH)
for gos_left in range(1,MAX_GOS+1):
next=[0]*len(expected)
for money in range(LENGTH-gos_left*2):
next[money]=max(expected[money], PROB*expected[money+1]+PROB*expected[money+2])#decision stop or go
expected=next
print "Expected win:", expected[0]
>>> Expected win: 1.37037037037
I'm glad both methods yielded the same result!

Algorithm to print all valid combations of n pairs of parenthesis

I'm working on the problem stated in the question statement. I know my solution is correct (ran the program) but I'm curious as to whether or not I'm analyzing my code (below) correctly.
def parens(num)
return ["()"] if num == 1
paren_arr = []
parens(num-1).each do |paren|
paren_arr << paren + "()" unless "()#{paren}" == "#{paren}()"
paren_arr << "()#{paren}"
paren_arr << "(#{paren})"
end
paren_arr
end
parens(3), as an example, will output the following:
["()()()", "(()())", "(())()", "()(())", "((()))"]
Here's my analysis:
Every f(n) value is roughly 3 times as many elements as f(n-1). So:
f(n) = 3 * f(n-1) = 3 * 3 * f(n-2) ~ (3^n) time cost.
By a similar analysis, the stack will be occupied by f(1)..f(n) and so the space complexity should be 3^n.
I'm not sure if this analysis for either time or space is correct. On the one hand, the stack only holds n function calls, but each of these calls returns an array ~3 times as big as the call before it. Does this factor into space cost? And is my time analysis correct?
As others have mentioned, your solution is not correct.
My favourite solution to this problem generates all the valid combinations by repeatedly incrementing the current string to the lexically next valid combination.
"Lexically next" breaks down into a few rules that make it pretty easy:
The first difference in the string changes a '(' to a ')'. Otherwise the next string would be lexically before the current one.
The first difference is as far to the right as possible. Otherwise there would be smaller increments.
The part after the first difference is lexically minimal, again because otherwise there would be smaller increments. In this case that means that all the '('s come before all the ')'.
So all you have to do is find the rightmost '(' that can be changed to a ')', flip it, and then append the correct number of '('s and ')'s.
I don't know Ruby, but in Python it looks like this:
current="(((())))"
while True:
print(current)
opens=0
closes=0
pos=0
for i in range(len(current)-1,-1,-1):
if current[i]==')':
closes+=1
else:
opens+=1
if closes > opens:
pos=i
break
if pos<1:
break
current = current[:pos]+ ")" + "("*opens + ")"*(closes-1)
Output:
(((())))
((()()))
((())())
((()))()
(()(()))
(()()())
(()())()
(())(())
(())()()
()((()))
()(()())
()(())()
()()(())
()()()()
Solutions like this turn out to be easy and fast for many types of "generate all the combinations" problems.
Recursive reasoning makes a simple solution. If the number of left parens remaining to emit is positive, emit one and recur. If the number of right parens remaining to emit is greater than the number of left, emit and recur. The base case is when all parens, both left and right, have been emitted. Print.
def parens(l, r = l, s = "")
if l > 0 then parens(l - 1, r, s + "(") end
if r > l then parens(l, r - 1, s + ")") end
if l + r == 0 then print "#{s}\n" end
end
As others have said, the Catalan numbers give the number of strings that will be printed.
While this Ruby implementation doesn't achieve it, a lower level language (like C) would make it easy to use a single string buffer: O(n) space. Due to substring copies, this one is O(n^2). But since the run time and output length are O(n!), O(n) space inefficiency doesn't mean much.
I found Tom Davis' article, "Catalan Numbers," very helpful in explaining one recursive method for defining the Catalan Numbers. I'll try to explain it myself (in part, to see how much of it I've understood) as it may be applied to finding the set of all unique arrangements of N matched parentheses (e.g., 1 (); 2 ()(), (()); etc. ).
For N > 1 let (A)B represent one arrangement of N matched parentheses, where A and B each have only balanced sets of parentheses. Then we know that if A contains k matched sets, B must have the other N - k - 1, where 0 <= k <= N - 1.
In the following example, a dot means the group has zero sets of parentheses:
C_0 => .
C_1 => (.)
To enumerate C_2, we arrange C_1 as AB in all ways and place the second parentheses around A:
. () = AB = C_0C_1 => (.)()
() . = AB = C_1C_0 => (()) .
Now for C_3, we have three partitions for N - 1, each with its own combinations: C_0C_2, C_1C_1, C_2C_0
C_0C_2 = AB = . ()() and . (()) => ()()(), ()(())
C_1C_1 = AB = ()() => (())()
C_2C_0 = AB = ()() . and (()) . => (()()), ((()))
We can code this method by keeping a set for each N and iterating over the combinations for each partition. We'll keep the individual arrangements as bits: 0 for left and 1 for right (this appears backwards when cast as a binary string).
def catalan
Enumerator.new do |y|
# the zero here represents none rather than left
s = [[0],[2]]
y << [0]
y << [2]
i = 2
while true
s[i] = []
(0..i - 1).each do |k|
as = s[k]
bs = s[i - k - 1]
as.each do |a|
bs.each do |b|
if a != 0
s[i] << ((b << (2*k + 2)) | (1 << (2*k + 1)) | (a << 1))
else
s[i] << (2 | (b << 2))
end
end # bs
end # as
end # k
y.yield(s[i])
i = i + 1
end # i
end # enumerator
end
catalan.take(4)
# => [[0], [2], [10, 12], [42, 50, 44, 52, 56]]
The yielder is lazy: although the list is infinite, we can generate as little as we like (using .take for example):
catalan.take(4).last.map{|x| x.to_s(2)}
# => ["101010", "110010", "101100", "110100", "111000"]
The former generation obliges us to keep all previous sets in order to issue the next. Alternatively, we can build a requested set through a more organic type, meandering recursion. This next version yields each arrangement to the block, so we can type:
catalan(4){
|x| (0..7).reduce(""){
|y,i| if x[i] == 0 then y + "(" else y + ")" end
}
}.take(14)
# => ["(((())))", "((()()))", "((())())", "((()))()", "(()(()))", "(()()())",
# "(()())()", "(())(())", "(())()()", "()((()))", "()(()())", "()(())()",
# "()()(())", "()()()()"]
Direct generation:
def catalan(n)
Enumerator.new do |y|
s = [[0,0,0]]
until s.empty?
left,right,result = s.pop
if left + right == 2 * n
y << yield(result)
end
if right < left
s << [left, right + 1, result | 1 << (left + right)]
end
if left < n
s << [left + 1, right, result]
end
end
end
end

Use of IF statement with matrices in fortran

I want to go through a matrix and check if any block of it is the same as a predefined unit. Here is my code. 'sd5' is the 2 by 2 predefined unit.
ALLOCATE (fList((n-1)**2,3))
fList = 0
p = 1
DO i = 1, n-1, 1
DO j = 1, n-1, 1
IF (TEST(i:i+1, j:j+1) == sd5) THEN
fList(p,1:3) = (i, j+1, 101) ! 101 should be replaced by submatrix number
END IF
p = p+1
END DO
END DO
The problem seems to be in the IF statement as four logical statements are returned in TEST(i:i+1, j:j+1) == sd5. I get this error:
Error: IF clause at (1) requires a scalar LOGICAL expression
I get another error:
fList(p,1:3) = (i, j+1, 101) ! 101 should be replaced by sub matrix number
1
Error: Expected PARAMETER symbol in complex constant at (1)
I do not understand this error, as all variables are integer which I defined.
First, if statements require scalar clauses.
(TEST(i:i+1, j:j+1) == sd5)
results in a 2x2 matrix containing .true. or .false.. Since you want to check all entries, the statement should read
IF ( all( TEST(i:i+1, j:j+1) == sd5) ) THEN
[ You could also use any if only one matching entry is sufficient. ]
The second statement is a little tricky, since you do not state what you want to achieve. As it is, it is not what you would expect. My guess is that you are trying to store a vector of length three, and the assignment should read
fList(p,1:3) = (/ i, j+1, 101 /)
or
fList(p,1:3) = [ i, j+1, 101 ]
The syntax you provided is in fact used to specify complex constants:
( Real, Imag )
In this form, Real and Imag need to be constants or literals themselves, cf. the Fortran 2008 Standard, R417.

Understanding the algorithm for pattern matching using an LCP array

Foreword: My question is mainly an algorithmic question, so even if you are not familiar with suffix and LCP arrays you can probably help me.
In this paper it is described how to efficiently use suffix and LCP arrays for string pattern matching.
I understood SA and LCP work and how the algorithm's runtime can be improved from O(P*log(N)) (where P is the length of the pattern and N is length of the string) to O(P+log(N)) (Thanks to Chris Eelmaa's answer here and jogojapans answer here).
I was trying to go through the algorithm in figure 4 which explains the usage of LLcp and RLcp. But I have problems understanding how it works.
The algorithm (taken from the source):
Explanation of the used variable names:
lcp(v,w) : Length of the longest common prefix of v and w
W = w0..wP-1 : pattern of length P
A = a0..aN-1 : the text (length N)
Pos[0..N-1] : suffix array
L_W : index (in A) of first occurrence of the matched pattern
M : middle index of current substring
L : lower bound
R : upper bound
Lcp : array of size N-2 such that Lcp[M] = lcp(A_Pos[L_M], A_pos[M]) where L_M is the lower bound of the unique interval with M in the middle
Rcp : array of size N-2 such that Rcp[M] = lcp(A_Pos[R_M], A_pos[M]) where R_M is the upper bound of the unique interval with M in the middle
Now I want to try the algorithm using the following example (partly taken from here):
SA | LCP | Suffix entry
-----------------------
5 | N/A | a
3 | 1 | ana
1 | 3 | anana
0 | 0 | banana
4 | 0 | na
2 | 2 | nana
A = "banana" ; N = 6
W = "ban" ; P = 3
I want to try to match a string, say ban and would expect the algorithm to return 0 as L_W.
Here is how I would step through the algorithm:
l = lcp("a", "ban") = 0
r = lcp("nana", "ban") = 0
if 0 = 3 or 'b' =< 'a' then // which is NOT the case for both conditions
L_W = 0
else if 0 < 3 or 'b' =< 'n' then // which is the case for both conditions
L_W = 6 // which means 'not found'
...
...
I feel like I am missing something but I can't find out what. Also I am wondering how the precomputed LCP array can be used instead of calling lcp(v,w).
I believe there was an error.
First condition is fairly easy to understand. When LCP length == pattern length, it's done. When your pattern is even smaller than or equal to the smallest one, then only choice is the smallest one.
The second condition is wrong. We can prove it by contradiction. r < P || Wr <= a... means r >= P && Wr > a... If r >= P, then how can we have Lw = N(not found), since we already have r length common prefix?

Simple recursion problem

I'm taking my first steps into recursion and dynamic programming and have a question about forming subproblems to model the recursion.
Problem:
How many different ways are there to
flip a fair coin 5 times and not have
three or more heads in a row?
If some could put up some heavily commented code (Ruby preferred but not essential) to help me get there. I am not a student if that matters, this is a modification of a Project Euler problem to make it very simple for me to grasp. I just need to get the hang of writing recursion formulas.
If you would like to abstract the problem into how many different ways are there to flip a fair coin Y times and not have Z or more heads in a row, that may be beneficial as well. Thanks again, this website rocks.
You can simply create a formula for that:
The number of ways to flip a coin 5 times without having 3 heads in a row is equal to the number of combinations of 5 coin flips minus the combinations with at least three heads in a row. In this case:
HHH-- (4 combinations)
THHH- (2 combinations)
TTHHH (1 combination)
The total number of combinations = 2^5 = 32. And 32 - 7 = 25.
If we flip a coin N times without Q heads in a row, the total amount is 2^N and the amount with at least Q heads is 2^(N-Q+1)-1. So the general answer is:
Flip(N,Q) = 2^N - 2^(N-Q+1) +1
Of course you can use recursion to simulate the total amount:
flipme: N x N -> N
flipme(flipsleft, maxhead) = flip(flipsleft, maxhead, 0)
flip: N x N x N -> N
flip(flipsleft, maxhead, headcount) ==
if flipsleft <= 0 then 0
else if maxhead<=headcount then 0
else
flip(flipsleft - 1, maxhead, headcount+1) + // head
flip(flipsleft - 1, maxhead, maxhead) // tail
Here's my solution in Ruby
def combination(length=5)
return [[]] if length == 0
combination(length-1).collect {|c| [:h] + c if c[0..1]!= [:h,:h]}.compact +
combination(length-1).collect {|c| [:t] + c }
end
puts "There are #{combination.length} ways"
All recursive methods start with an early out for the end case.
return [[]] if length == 0
This returns an array of combinations, where the only combination of zero length is []
The next bit (simplified) is...
combination(length-1).collect {|c| [:h] + c } +
combination(length-1).collect {|c| [:t] + c }
So.. this says.. I want all combinations that are one shorter than the desired length with a :head added to each of them... plus all the combinations that are one shorter with a tail added to them.
The way to think about recursion is.. "assuming I had a method to do the n-1 case.. what would I have to add to make it cover the n case". To me it feels like proof by induction.
This code would generate all combinations of heads and tails up to the given length.
We don't want ones that have :h :h :h. That can only happen where we have :h :h and we are adding a :h. So... I put an if c[0..1] != [:h,:h] on the adding of the :h so it will return nil instead of an array when it was about to make an invalid combination.
I then had to compact the result to ignore all results that are just nil
Isn't this a matter of taking all possible 5 bit sequences and removing the cases where there are three sequential 1 bits (assuming 1 = heads, 0 = tails)?
Here's one way to do it in Python:
#This will hold all possible combinations of flipping the coins.
flips = [[]]
for i in range(5):
#Loop through the existing permutations, and add either 'h' or 't'
#to the end.
for j in range(len(flips)):
f = flips[j]
tails = list(f)
tails.append('t')
flips.append(tails)
f.append('h')
#Now count how many of the permutations match our criteria.
fewEnoughHeadsCount = 0
for flip in flips:
hCount = 0
hasTooManyHeads = False
for c in flip:
if c == 'h': hCount += 1
else: hCount = 0
if hCount >= 3: hasTooManyHeads = True
if not hasTooManyHeads: fewEnoughHeadsCount += 1
print 'There are %s ways.' % fewEnoughHeadsCount
This breaks down to:
How many ways are there to flip a fair coin four times when the first flip was heads + when the first flip was tails:
So in python:
HEADS = "1"
TAILS = "0"
def threeOrMoreHeadsInARow(bits):
return "111" in bits
def flip(n = 5, flips = ""):
if threeOrMoreHeadsInARow(flips):
return 0
if n == 0:
return 1
return flip(n - 1, flips + HEADS) + flip(n - 1, flips + TAILS)
Here's a recursive combination function using Ruby yield statements:
def combinations(values, n)
if n.zero?
yield []
else
combinations(values, n - 1) do |combo_tail|
values.each do |value|
yield [value] + combo_tail
end
end
end
end
And you could use regular expressions to parse out three heads in a row:
def three_heads_in_a_row(s)
([/hhh../, /.hhh./, /..hhh/].collect {|pat| pat.match(s)}).any?
end
Finally, you would get the answer using something like this:
total_count = 0
filter_count = 0
combinations(["h", "t"], 5) do |combo|
count += 1
unless three_heads_in_a_row(combo.join)
filter_count += 1
end
end
puts "TOTAL: #{ total_count }"
puts "FILTERED: #{ filter_count }"
So that's how I would do it :)

Resources