I made an interpreter for a script language containing a loop for, using javacc I have defined the grammar but I can't have a way to back up to line to repeat the execution of the block "for".
how back up the token manager so that loop-bodies can be re-parsed, and thus reevaluated, over and over again ?.
void For(): {ArrayList<String> lst;Token n,v;int i=0;} {
"for" "(" n=<ID> ":" v=<ID> ")" "{"
(actions()";" )+
"}"
}
As explained in the JavaCC FAQ (http://www.engr.mun.ca/~theo/JavaCC-FAQ/javacc-faq-moz.htm#tth_sEc7.3), the best approach is to output some form of intermediate representation and then interpret that. Some common approaches follow.
One way is to output a tree from the parser. Then interpret the tree. The interpreter could use the Interpreter design pattern. https://en.wikipedia.org/wiki/Interpreter_pattern
A second way is to translate to the machine code for a virtual machine. An example of this approach can be seen in the Turtle Talk interpreter at http://www.engr.mun.ca/~theo/Courses/sd/5895-downloads/turtle-talk.zip .
A third way is to translate to another high-level programming language which can then be compiled and executed.
As other answers have said, this is why usually people just build a data structure in memory, where the group of commands is parsed once and can then just be executed repeatedly.
But if you want to do a pure interpreter for now, what is needed is to remember the start position of the token that turned out to be the loop (i.e. the "for" token) and rewind to that position in the token stream. You might also have to scan ahead for the opening and closing brackets ("{" and "}") so that you know the end of the loop.
Once you have that, your "for" command changes from an actual loop into a variant of the "if" statement. If the condition is true, you execute the commands, if the condition is false, you jump over everything, including the closing "}". And when you hit the "}", you jump back to the position of the "for" to check the condition all over again.
We can use the JavaCode Production like below,
provided the action() production takes care of each line.
Please Move semicolon to the actions() Production.
TOKEN : {
<LCURLY : "{" >
<RCURLY : "}" >
}
void For(): {ArrayList<String> lst;Token n,v;int i=0;} {
"for" "(" n=<ID> ":" v=<ID> ")" <LCURLY>
loopBody ();
<RCURLY>
}
JAVACODE
void loopBody () {
Token t = getNextToken();
while (t.kind != RCURLY) {
actions();
t = getNextToken();
}
}
Hope This will help.
Related
I need to check if any elements of a large (60,000+ elements) array are present in a long string of text. My current code looks like this:
if $TARGET_PARTLIST.any? { |target_pn| pdf_content_string.include? target_pn }
self.last_match_code = target_pn
self.is_a_match = true
end
I get a syntax error undefined local variable or method target_pn.
Could someone let me know the correct syntax to use for this block of code? Also, if anyone knows of a quicker way to do this, I'm all ears!
In this case, all your syntax is correct, you've just got a logic error. While target_pn is defined (as a parameter) inside the block passed to any?, it is not defined in the block of the if statement because the scope of the any?-block ends with the closing curly brace, and target_pn is not available outside its scope. A correct (and more idiomatic) version of your code would look like this:
self.is_a_match = $TARGET_PARTLIST.any? do |target_pn|
included = pdf_content_string.include? target_pn
self.last_match_code = target_pn if included
included
end
Alternately, as jvillian so kindly suggests, one could turn the string into an array of words, then do an intersection and see if the resulting set is nonempty. Like this:
self.is_a_match = !($TARGET_PARTLIST &
pdf_content_string.gsub(/[^A-Za-z ]/,"")
.split).empty?
Unfortunately, this approach loses self.last_match_code. As a note, pointed out by Sergio, if you're dealing with non-English languages, the above regex will have to be changed.
Hope that helps!
You should use Enumerable#find rather than Enumerable#any?.
found = $TARGET_PARTLIST.find { |target_pn| pdf_content_string.include? target_pn }
if found
self.last_match_code = found
self.is_a_match = true
end
Note this does not ensure that the string contains a word that is an element of $TARGET_PARTLIST. For example, if $TARGET_PARTLIST contains the word "able", that string will be found in the string, "Are you comfortable?". If you only want to match words, you could do the following.
found = $TARGET_PARTLIST.find { |target_pn| pdf_content_string[/\b#{target_pn}\b/] }
Note this uses the method String#[].
\b is a word break in the regular expression, meaning that the first (last) character of the matched cannot be preceded (followed) by a word character (a letter, digit or underscore).
If speed is important it may be faster to use the following.
found = $TARGET_PARTLIST.find { |target_pn|
pdf_content_string.include?(target_on) && pdf_content_string[/\b#{target_pn}\b/] }
A probably more performant way would be to move all this into native code by letting Regexp search for it.
# needed only once
TARGET_PARTLIST_RE = Regexp.new("\\b(?:#{$TARGET_PARTLIST.sort.map { |pl| Regexp.escape(pl) }.join('|')})\\b")
# to check
self.last_match_code = pdf_content_string[TARGET_PARTLIST_RE]
self.is_a_match = !self.last_match_code.nil?
A much more performant way would be to build a prefix tree and create the regexp using the prefix tree (this optimises the regexp lookup), but this is a bit more work :)
So the title might be a little bit misleading, but I can't think of any better way to phrase it.
Basically, I'm writing a lexical-scanner using cygwin/lex. A part of the code reads a token /* . It the goes into a predefined state C_COMMENT, and ends when C_COMMENT"/*". Below is the actual code
"/*" {BEGIN(C_COMMENT); printf("%d: /*", linenum++);}
<C_COMMENT>"*/" { BEGIN(INITIAL); printf("*/\n"); }
<C_COMMENT>. {printf("%s",yytext);}
The code works when the comment is in a single line, such as
/* * Example of comment */
It will print the current line number, with the comment behind. But it doesn't work if the comment spans multiple lines. Rewriting the 3rd line into
<C_COMMENT>. {printf("%s",yytext);
printf("\n");}
doesn't work. It will result in \n printed for every letter in the comment. I'm guessing it has something to do with C having no strings or maybe I'm using the states wrong.
Hope someone will be able to help me out :)
Also if there's any other info you need, just ask, and I'll provide.
The easiest way to echo the token scanned by a pattern is to use the special action ECHO:
"/*" { printf("%d: ", linenum++); ECHO; BEGIN(C_COMMENT); }
<C_COMMENT>"*/" { ECHO; BEGIN(INITIAL); }
<C_COMMENT>. { ECHO; }
None of the above rules matches a newline inside a comment, because in (f)lex . doesn't match newlines:
<C_COMMENT>\n { linenum++; ECHO; }
A faster way of recognizing C comments is with a single regular expression, although it's a little hard to read:
[/][*][^*]*[*]+([^/*][^*][*]+)*[/]
In this case, you'll have to rescan the comment to count newlines, unless you get flex to do the line number counting.
flex scanners maintain a line number count in yylineno, if you request that feature (using %option yylineno). It's often more efficient and always more reliable than keeping the count yourself. However, in the action, the value of yylineno is the line number count at the end of the pattern, not at the beginning, which can be misleading for multiline patterns. A common workaround is to save the value of yylineno in another variable at the beginning of the token scan.
I am having a string as below:
str1='"{\"#Network\":{\"command\":\"Connect\",\"data\":
{\"Id\":\"xx:xx:xx:xx:xx:xx\",\"Name\":\"somename\",\"Pwd\":\"123456789\"}}}\0"'
I wanted to extract the somename string from the above string. Values of xx:xx:xx:xx:xx:xx, somename and 123456789 can change but the syntax will remain same as above.
I saw similar posts on this site but don't know how to use regex in the above case.
Any ideas how to extract the above string.
Parse the string to JSON and get the values that way.
require 'json'
str = "{\"#Network\":{\"command\":\"Connect\",\"data\":{\"Id\":\"xx:xx:xx:xx:xx:xx\",\"Name\":\"somename\",\"Pwd\":\"123456789\"}}}\0"
json = JSON.parse(str.strip)
name = json["#Network"]["data"]["Name"]
pwd = json["#Network"]["data"]["Pwd"]
Since you don't know regex, let's leave them out for now and try manual parsing which is a bit easier to understand.
Your original input, without the outer apostrophes and name of variable is:
"{\"#Network\":{\"command\":\"Connect\",\"data\":{\"Id\":\"xx:xx:xx:xx:xx:xx\",\"Name\":\"somename\",\"Pwd\":\"123456789\"}}}\0"
You say that you need to get the 'somename' value and that the 'grammar will not change'. Cool!.
First, look at what delimits that value: it has quotes, then there's a colon to the left and comma to the right. However, looking at other parts, such layout is also used near the command and near the pwd. So, colon-quote-data-quote-comma is not enough. Looking further to the sides, there's a \"Name\". It never occurs anywhere in the input data except this place. This is just great! That means, that we can quickly find the whereabouts of the data just by searching for the \"Name\" text:
inputdata = .....
estposition = inputdata.index('\"Name\"')
raise "well-known marker wa not found in the input" unless estposition
now, we know:
where the part starts
and that after the "Name" text there's always a colon, a quote, and then the-interesting-data
and that there's always a quote after the interesting-data
let's find all of them:
colonquote = inputdata.index(':\"', estposition)
datastart = colonquote+3
lastquote = inputdata.index('\"', datastart)
dataend = lastquote-1
The index returns the start position of the match, so it would return the position of : and position of \. Since we want to get the text between them, we must add/subtract a few positions to move past the :\" at begining or move back from \" at end.
Then, fetch the data from between them:
value = inputdata[datastart..dataend]
And that's it.
Now, step back and look at the input data once again. You say that grammar is always the same. The various bits are obviously separated by colons and commas. Let's try using it directly:
parts = inputdata.split(/[:,]/)
=> ["\"{\\\"#Network\\\"",
"{\\\"command\\\"",
"\\\"Connect\\\"",
"\\\"data\\\"",
"\n{\\\"Id\\\"",
"\\\"xx",
"xx",
"xx",
"xx",
"xx",
"xx\\\"",
"\\\"Name\\\"",
"\\\"somename\\\"",
"\\\"Pwd\\\"",
"\\\"123456789\\\"}}}\\0\""]
Please ignore the regex for now. Just assume it says a colon or comma. Now, in parts you will get all the, well, parts, that were detected by cutting the inputdata to pieces at every colon or comma.
If the layout never changes and is always the same, then your interesting-data will be always at place 13th:
almostvalue = parts[12]
=> "\\\"somename\\\""
Now, just strip the spurious characters. Since the grammar is constant, there's 2 chars to be cut from both sides:
value = almostvalue[2..-3]
Ok, another way. Since regex already showed up, let's try with them. We know:
data is prefixed with \"Name\" then colon and slash-quote
data consists of some text without quotes inside (well, at least I guess so)
data ends with a slash-quote
the parts in regex syntax would be, respectively:
\"Name\":\"
[^\"]*
\"
together:
inputdata =~ /\\"Name\\":\\"([^\"]*)\\"/
value = $1
Note that I surrounded the interesting part with (), hence after sucessful match that part is available in the $1 special variable.
Yet another way:
If you look at the grammar carefully, it really resembles a set of embedded hashes:
\"
{ \"#Network\" :
{ \"command\" : \"Connect\",
\"data\" :
{ \"Id\" : \"xx:xx:xx:xx:xx:xx\",
\"Name\" : \"somename\",
\"Pwd\" : \"123456789\"
}
}
}
\0\"
If we'd write something similar as Ruby hashes:
{ "#Network" =>
{ "command" => "Connect",
"data" =>
{ "Id" => "xx:xx:xx:xx:xx:xx",
"Name" => "somename",
"Pwd" => "123456789"
}
}
}
What's the difference? the colon was replaced with =>, and the slashes-before-quotes are gone. Oh, and also opening/closing \" is gone and that \0 at the end is gone too. Let's play:
tmp = inputdata[2..-4] # remove opening \" and closing \0\"
tmp.gsub!('\"', '"') # replace every \" with just "
Now, what about colons.. We cannot just replace : with =>, because it would damage the internal colons of the xx:xx:xx:xx:xx:xx part.. But, look: all the other colons have always a quote before them!
tmp.gsub!('":', '"=>') # replace every quote-colon with quote-arrow
Now our tmp is:
{"#Network"=>{"command"=>"Connect","data"=>{"Id"=>"xx:xx:xx:xx:xx:xx","Name"=>"somename","Pwd"=>"123456789"}}}
formatted a little:
{ "#Network"=>
{ "command"=>"Connect",
"data"=>
{ "Id"=>"xx:xx:xx:xx:xx:xx","Name"=>"somename","Pwd"=>"123456789" }
}
}
So, it looks just like a Ruby hash. Let's try 'destringizing' it:
packeddata = eval(tmp)
value = packeddata['#Network']['data']['Name']
Done.
Well, this has grown a bit and Jonas was obviously faster, so I'll leave the JSON part to him since he wrote it already ;) The data was so similar to Ruby hash because it was obviously formatted as JSON which is a hash-like structure too. Using the proper format-reading tools is usually the best idea, but mind that the JSON library when asked to read the data - will read all of the data and then you can ask them "what was inside at the key xx/yy/zz", just like I showed you with the read-it-as-a-Hash attempt. Sometimes when your program is very short on the deadline, you cannot afford to read-it-all. Then, scanning with regex or scanning manually for "known markers" may (not must) be much faster and thus prefereable. But, still, much less convenient. Have fun.
I am working on a new programming language rip, and I'm having trouble getting to the bottom of some infinite loops. Is there a way to print out each rule as it gets called, such that I can see the rules that are recursing? I've tried walking through the code in my head, and I just don't see it. Any help would be much appreciated.
To flesh out Raving Genius’s answer:
The method to patch is actually Parslet::Atoms::Context#lookup. View it on GitHub (permalink to current version). In your own code, you can patch that method to print obj like this:
class Parslet::Atoms::Context
def lookup(obj, pos)
p obj
#cache[pos][obj.object_id]
end
end
Run that code any time before you call parse on your parser, and it will take effect. Sample output:
>> parser = ConsistentNewlineTextParser.new
=> LINES
>> parser.parse("abc")
LINES
(line_content:LINE_CONTENT NEWLINE){0, } line_content:LINE_CONTENT
(line_content:LINE_CONTENT NEWLINE){0, }
line_content:LINE_CONTENT NEWLINE
LINE_CONTENT
WORD
\\w{0, }
\\w
\\w
\\w
\\w
NEWLINE
dynamic { ... }
FIRST_NEWLINE
'? '
'
'?
'
'
'
LINE_CONTENT
=> {:line_content=>"abc"#0}
I figured it out: editing Parslet::Atom::Context#lookup to output the obj parameter will show each rule as it is being called.
My branch of Parslet automatically detects endless loops, and exits out reporting expression that is repeating without consuming anything.
https://github.com/nigelthorne/parslet
see Parse markdown indented code block for an example.
I'm trying to scan through a JavaScript document that has many functions defined throughout it, and to delete a function from the document. I'm hoping that there's a trivial regex trick that I can do here.
Example:
Some JavaScript document:
function wanted_foo(bar) {
...
...
}
function unwanted_foo(bar) {
...
inner_foo {
...
...
}
}
function wanted_foo(bar) {
...
...
inner_foo {
...
...
}
}
The obvious problem here is that I need to match things of the form "function unwanted_foo(bar) { ... }", except that I only need to match up until the last curly brace of the function, and not to the next curly brace of another function. Is there a simple Regex way of doing this?
Normal regular expressions can't do this because they can't keep count of the previously matched braces and therefore can't find the last one, nonetheless it seems many implementations have capabilities that go beyond normal regex's and Ruby is one of those cases.
Hare are a couple of references about that, although it might not be what you would call simple.
Matching braces in ruby with a character in front
Backreferences
One "trick" is to use a counter together with regex.
Initialize your counter to 0
Match something of the form /^\s*function\s+\w+\(.*\)\s*\{ and when you have found this pattern, remember the position.
When you match that pattern, increment your counter
Now match { and } separately and increment or decrement your counter depending on what you've found.
Keep doing this until your counter is 0 again, then you should have found a function
Hope that's useful to you?