Ruby parslet: parsing multiple lines - ruby

I'm looking for a way to match multiple lines Parslet.
The code looks like this:
rule(:line) { (match('$').absent? >> any).repeat >> match('$') }
rule(:lines) { line.repeat }
However, lines will always end up in an infinite loop which is because match('$') will endlessly repeat to match end of string.
Is it possible to match multiple lines that can be empty?
irb(main)> lines.parse($stdin.read)
This
is
a
multiline
string^D
should match successfully. Am I missing something? I also tried (match('$').absent? >> any.maybe).repeat(1) >> match('$') but that doesn't match empty lines.
Regards,
Danyel.

I usually define a rule for end_of_line. This is based on the trick in http://kschiess.github.io/parslet/tricks.html for matching end_of_file.
class MyParser < Parslet::Parser
rule(:cr) { str("\n") }
rule(:eol?) { any.absent? | cr }
rule(:line_body) { (eol?.absent? >> any).repeat(1) }
rule(:line) { cr | line_body >> eol? }
rule(:lines?) { line.repeat (0)}
root(:lines?)
end
puts MyParser.new.parse(""" this is a line
so is this
that was too
This ends""").inspect
Obviously if you want to do more with the parser than you can achieve with String::split("\n") you will replace the line_body with something useful :)
I had a quick go at answering this question and mucked it up. I just though I would explain the mistake I made, and show you how to avoid mistakes of that kind.
Here is my first answer.
rule(:eol) { str('\n') | any.absent? }
rule(:line) { (eol.absent? >> any).repeat >> eol }
rule(:lines) { line.as(:line).repeat }
I didn't follow my usual rules:
Always make repeat count explicit
Any rule that can match zero length strings, should have name ending in a '?'
So lets apply these...
rule(:eol?) { str('\n') | any.absent? }
# as the second option consumes nothing
rule(:line?) { (eol.absent? >> any).repeat(0) >> eol? }
# repeat(0) can consume nothing
rule(:lines?) { line.as(:line?).repeat(0) }
# We have a problem! We have a rule that can consume nothing inside a `repeat`!
Here see why we get an infinite loop. As the input is consumed, you end up with just the end of file, which matches eol? and hence line? (as the line body can be empty). Being inside lines' repeat, it keeps matching without consuming anything and loops forever.
We need to change the line rule so it always consumes something.
rule(:cr) { str('\n') }
rule(:eol?) { cr | any.absent? }
rule(:line_body) { (eol.absent? >> any).repeat(1) }
rule(:line) { cr | line_body >> eol? }
rule(:lines?) { line.as(:line).repeat(0) }
Now line has to match something, either a cr (for empty lines), or at least one character followed by the optional eol?. All repeats have bodies that consume something. We are now golden.

I think you have two, related, problems with your matching:
The pseudo-character match $ does not consume any real characters. You still need to consume the newlines somehow.
Parslet is munging the input in some way, making $ match in places you might not expect. The best result I could get using $ ended up matching each individual character.
Much safer to use \n as the end-of-line character. I got the following to work (I am somewhat of a beginner with Parslet myself, so apologies if it could be clearer):
require 'parslet'
class Lines < Parslet::Parser
rule(:text) { match("[^\n]") }
rule(:line) { ( text.repeat(0) >> match("\n") ) | text.repeat(1) }
rule(:lines) { line.as(:line).repeat }
root :lines
end
s = "This
is
a
multiline
string"
p Lines.new.parse( s )
The rule for the line is complex because of the need to match empty lines and a possible final line without a \n.
You don't have to use the .as(:line) syntax - I just added it to show clearly that the :line rule is matching each line individually, and not simply consuming the whole input.

Related

awk script: removing line previous to pattern match and after, until a blank line

I began learning awk yesterday in attempt to solve this problem (and learn a useful new language). At first I tried using sed, but soon realized it was not the correct tool to access/manipulate lines previous to a pattern match.
I need to:
Remove all lines containing "foo" (trivial on it's own, but not whilst keeping track of previous lines)
Find lines containing "bar"
Remove the line previous to the one containing "bar"
Remove all lines after and including the line containing "bar" until we reach a blank line
Example input:
This is foo stuff
I like food!
It is tasty!
stuff
something
stuff
stuff
This is bar
Hello everybody
I'm Dr. Nick
things
things
things
Desired output:
It is tasty!
stuff
something
stuff
things
things
things
My attempt:
{
valid=1; #boolean variable to keep track if x is valid and should be printed
if ($x ~ /foo/){ #x is valid unless it contains foo
valid=0; #invalidate x so that is doesn't get printed at the end
next;
}
if ($0 ~ /bar/){ #if the current line contains bar
valid = 0; #x is invalid (don't print the previous line)
while (NF == 0){ #don't print until we reach an empty line
next;
}
}
if (valid == 1){ #x was a valid line
print x;
}
x=$0; #x is a reference to the previous line
}
Super bonus points (not needed to solve my problem but I'm interesting in learning how this would be done):
Ability to remove n lines before pattern match
Option to include/disclude the blank line in output
Below is an alternative awk script using patterns & functions to trigger state changes and manage output, which produces the same result.
function show_last() {
if (!skip && !empty) {
print last
}
last = $0
empty = 0
}
function set_skip_empty(n) {
skip = n
last = $0
empty = NR <= 0
}
BEGIN { set_skip_empty(0) }
END { show_last() ; }
/foo/ { next; }
/bar/ { set_skip_empty(1) ; next }
/^ *$/ { if (skip > 0) { set_skip_empty(0); next } else show_last() }
!/^ *$/{ if (skip > 0) { next } else show_last() }
This works by retaining the "current" line in a variable last, which is either
ignored or output, depending on other events, such as the occurrence of foo and bar.
The empty variable keeps track of whether or not the last variable is really
a blank line, or simple empty from inception (e.g., BEGIN).
To accomplish the "bonus points", replace last with an array of lines which could then accumulate N number of lines as desired.
To exclude blank lines (such as the one that terminates the bar filter), replace the empty test with a test on the length of the last variable. In awk, empty lines have no length (but, lines with blanks or tabs *do* have a length).
function show_last() {
if (!skip && length(last) > 0) {
print last
}
last = $0
}
will result in no blank lines of output.
Read each blank-lines-separated paragraph in as a string, then do a gsub() removing the strings that match the RE for the pattern(s) you care about:
$ awk -v RS= -v ORS="\n\n" '{ gsub(/[^\n]*foo[^\n]*\n|\n[^\n]*\n[^\n]*bar.*/,"") }1' file
It is tasty!
stuff
something
stuff
things
things
things
To remove N lines, change [^\n]*\n to ([^\n]*\n){N}.
To not remove part of the RE use GNU awk and use gensub() instead of gsub().
To remove the blank lines, change the value of ORS.
Play with it...
This awk should work without storing full file in memory:
awk '/bar/{skip=1;next} skip && p~/^$/ {skip=0} NR>1 && !skip && !(p~/foo/){print p} {p=$0}
END{if (!skip && !(p~/foo/)) print p}' file
It is tasty!
stuff
something
stuff
things
things
things
One way:
awk '
/foo/ { next }
flag && NF { next }
flag && !NF { flag = 0 }
/bar/ { delete line[NR-1]; idx-=1; flag = 1; next }
{ line[++idx] = $0 }
END {
for (x=1; x<=idx; x++) print line[x]
}' file
It is tasty!
stuff
something
stuff
things
things
things
If line contains foo skip it.
If flag is enabled and line is not blank skip it.
If flag is enabled and line is blank disable the flag.
If line contains bar delete the previous line, reset the counter, enable the flag and skip it
Store all lines that manages through in array indexed at incrementing number
In the END block print the lines.
Side Notes:
To remove n number of lines before a pattern match, you can create a loop. Start with current line number and using a reverse for loop you can remove lines from your temporary cache (array). You can then subtract n from your self defined counter variable.
To include or exclude blank lines you can use the NF variable. For a typical line, NF variable is set to number of fields based on your field separator. For blank lines this variable is 0. For example, if you modify the line above END block to NF { line[++idx] = $0 } in the answer above you will see we have bypassed all blank lines from output.

Parslet word until delimeter present

I'm just starting with ruby and parslet, so this might be obvious to others (hopefully).
I'm wanting to get all the words up until a delimiter (^) without consuming it
The following rule works (but consumes the delimeter) with a result of {:wrd=>"otherthings"#0, :delim=>"^"#11}
require 'parslet'
class Mini < Parslet::Parser
rule(:word) { match('[a-zA-Z]').repeat}
rule(:delimeter) { str('^') }
rule(:othercontent) { word.as(:wrd) >> delimeter.as(:delim) }
root(:othercontent)
end
puts Mini.new.parse("otherthings^")
I was trying to use the 'present?',
require 'parslet'
class Mini < Parslet::Parser
rule(:word) { match('[a-zA-Z]').repeat}
rule(:delimeter) { str('^') }
rule(:othercontent) { word.as(:wrd) >> delimeter.present? }
root(:othercontent)
end
puts Mini.new.parse("otherthings^")
but this throws an exception:
Failed to match sequence (wrd:WORD &DELIMETER) at line 1 char 12. (Parslet::ParseFailed)
At a later stage I'll want to inspect the word to the right of the delimeter to build up a more complex grammar which is why I don't want to consume the delimeter.
I'm using parslet 1.5.0.
Thanks for your help!
TL;DR;
If you care what is before the "^" you should parse that first.
--- longer answer ---
A parser will always consume all the text. If it can't consume everything, then the document is not fully described by the grammar. Rather than thinking of it as something performing "splits" on your text... instead think of it as a clever state machine consuming a stream of text.
So... as your full grammar needs to consume all the document... when developing your parser, you can't make it to parse some part and leave the rest. You want it to transform your document into a tree so you can manipulate it into it's final from.
If you really wanted to just consume all text before a delimiter, then you could do something like this...
Say I was going to parse a '^' separated list of things.
I could have the following rules
rule(:thing) { (str("^").absent? >> any).repeat(1) } # anything that's not a ^
rule(:list) { thing >> ( str("^") >> thing).repeat(0) } #^ separated list of things
This would work as follows
parse("thing1^thing2") #=> "thing1^thing2"
parse("thing1") #=> "thing1"
parse("thing1^") #=> ERROR ... nothing after the ^ there should be a 'thing'
This would mean list would match a string that doesn't end or start with an '^'. To be useful however I need to pull out the bits that are the values with the "as" keyword
rule(:thing) { (str("^").absent? >> any).repeat(1).as(:thing) }
rule(:list) { thing >> ( str("^") >> thing).repeat(0) }
Now when list matches a string I get an array of hashes of "things".
parse("thing1^thing2") #=> [ {:thing=>"thing1"#0} , {:thing=>"thing2"#7} ]
In reality however you probably care what a 'thing' is... not just anything will go there.
In that case.. you should start by defining those rules... because you don't want to use the parser to split by "^" then re-parse the strings to work out what they are made of.
For example:
parse("6 + 4 ^ 2")
# => [ {:thing=>"6 + 4 "#0}, {:thing=>" 2"#7} ]
And I probably want to ignore the white_space around the "thing"s and I probably want to deal with the 6 the + and the 4 all separately. When I do that I am going to have to throw away my "all things that aren't '^'" rule.

Is backreference available in Parslet?

Is there a way to backreference a previous string in parslet similarly to the \1 functionality in typical regular expressions ?
I want to extract the characters within a block such as:
Marker SomeName
some random text, numbers123
and symbols !#%
SomeName
in which "Marker" is a known string but "SomeName" is not known a-priori, so I believe I need something like:
rule(:name) { ( match('\w') >> match('\w\d') ).repeat(1) }
rule(:text_within_the_block) {
str('Marker') >> name >> any.repeat.as(:text_block) >> backreference_to_name
}
What I don't know is how to write the backreference_to_name rule using Parslet and/or Ruby language.
From http://kschiess.github.io/parslet/parser.html
Capturing input
Sometimes a parser needs to match against something that was already
matched against. Think about Ruby heredocs for example:
str = <-HERE
This is part of the heredoc.
HERE
The key to matching this kind of document is to capture part of the
input first and then construct the rest of the parser based on the
captured part. This is what it looks like in its simplest form:
match['ab'].capture(:capt) >> # create the capture
dynamic { |s,c| str(c.captures[:capt]) } # and match using the capture
The key here is that the dynamic block returns a lazy parser. It's only evaluated at the point it's being used and gets passed it's current context to reference at the point of execution.
-- Updated : To add a worked example --
So for your example:
require 'parslet'
require 'parslet/convenience'
class Mini < Parslet::Parser
rule(:name) { match("[a-zA-Z]") >> match('\\w').repeat }
rule(:text_within_the_block) {
str('Marker ') >>
name.capture(:namez).as(:name) >>
str(" ") >>
dynamic { |_,scope|
(str(scope.captures[:namez]).absent? >> any).repeat
}.as(:text_block) >>
dynamic { |src,scope| str(scope.captures[:namez]) }
}
root (:text_within_the_block)
end
puts Mini.new.parse_with_debug("Marker BOB some text BOB") .inspect
#=> {:name=>"BOB"#7, :text_block=>"some text "#11}
This required a couple of changes.
I changed rule(:name) to match a single word and added a str(" ") to detect that word had ended. (Note: \w is short for [A-Za-z0-9_] so it includes digits)
I changed the "any" match to be conditional on the text not being the :name text. (otherwise it consumes the 'BOB' and then fails to match, ie. it's greedy!)
I don't exactly want to support stackoverflow, but as you seem to be a parslet user, here goes: Try asking on the mailing list for a real nice answer. (http://dir.gmane.org/gmane.comp.lang.ruby.parslet)
What you call back-reference here is called a 'capture' in parslet. Please see the example 'capture.rb' in parslets source tree.

How do I handle C-style comments in Ruby using Parslet?

Taking as a starting point the code example from the Parslet's own creator (available in this link) I need to extend it so as to retrieve all the non-commented text from a file written in a C-like syntax.
The provided example is able to successfully parse C-style comments, treating these areas as regular line spaces. However, this simple example only expects 'a' characters in the non-commented areas of the file such as the input example:
a
// line comment
a a a // line comment
a /* inline comment */ a
/* multiline
comment */
The rule used to detect the non-commented text is simply:
rule(:expression) { (str('a').as(:a) >> spaces).as(:exp) }
Therefore, what I need is to generalize the previous rule to get all the other (non-commented) text from a more generic file such as:
word0
// line comment
word1 // line comment
phrase /* inline comment */ something
/* multiline
comment */
I am new to Parsing Expression Grammars and neither of my previous trials succeeded.
The general idea is that everything is code (aka non-comment) until one of the sequences // or /* appears. You can reflect this with a rule like this:
rule(:code) {
(str('/*').absent? >> str('//').absent? >> any).repeat(1).as(:code)
}
As mentioned in my comment, there is a small problem with strings, though. When a comment occurs inside a string, it obviously is part of the string. If you were to remove comments from your code, you would then alter the meaning of this code. Therefore, we have to let the parser know what a string is, and that any character inside there belongs to it. Another thing are escape sequences. For example the string "foo \" bar /*baz*/", which contains a literal double quote, would actually be parsed as "foo \", followed by some code again. This is of course something that needs to be addressed. I have written a complete parser that handles all of the above cases:
require 'parslet'
class CommentParser < Parslet::Parser
rule(:eof) {
any.absent?
}
rule(:block_comment_text) {
(str('*/').absent? >> any).repeat.as(:comment)
}
rule(:block_comment) {
str('/*') >> block_comment_text >> str('*/')
}
rule(:line_comment_text) {
(str("\n").absent? >> any).repeat.as(:comment)
}
rule(:line_comment) {
str('//') >> line_comment_text >> (str("\n").present? | eof)
}
rule(:string_text) {
(str('"').absent? >> str('\\').maybe >> any).repeat
}
rule(:string) {
str('"') >> string_text >> str('"')
}
rule(:code_without_strings) {
(str('"').absent? >> str('/*').absent? >> str('//').absent? >> any).repeat(1)
}
rule(:code) {
(code_without_strings | string).repeat(1).as(:code)
}
rule(:code_with_comments) {
(code | block_comment | line_comment).repeat
}
root(:code_with_comments)
end
It will parse your input
word0
// line comment
word1 // line comment
phrase /* inline comment */ something
/* multiline
comment */
to this AST
[{:code=>"\n word0\n "#0},
{:comment=>" line comment"#13},
{:code=>"\n word1 "#26},
{:comment=>" line comment"#37},
{:code=>"\n phrase "#50},
{:comment=>" inline comment "#61},
{:code=>" something \n "#79},
{:comment=>" multiline\n comment "#94},
{:code=>"\n"#116}]
To extract everything except the comments you can do:
input = <<-CODE
word0
// line comment
word1 // line comment
phrase /* inline comment */ something
/* multiline
comment */
CODE
ast = CommentParser.new.parse(input)
puts ast.map{|node| node[:code] }.join
which will produce
word0
word1
phrase something
Another way to handle comments is to consider them white space. For example:
rule(:space?) do
space.maybe
end
rule(:space) do
(block_comment | line_comment | whitespace).repeat(1)
end
rule(:whitespace) do
match('/s')
end
rule(:block_comment) do
str('/*') >>
(str('*/').absent >> match('.')).repeat(0) >>
str('*/')
end
rule (:line_comment) do
str('//') >> match('[^\n]') >> str("\n")
end
Then, when you are writing rules with white-space, such as this entirely off-the-cuff and probably wrong rule for C,
rule(:assignment_statement) do
lvalue >> space? >> str('=') >> space? >> rvalue >> str(';')
end
comments get "eaten" by the parser without any fuss. Anywhere white space can or must appear, comments of any kind are allowed, and are treated as white space.
This approach is not as suitable for your exact problem, which is to recognize non-comment text in a C program, but it works very well in a parser which must recognize the full language.

Parslet : exclusion clause

I am currently writting a Ruby parser using Ruby, and more precisely Parslet, since I think it is far more easier to use than Treetop or Citrus. I create my rules using the official specifications, but there are some statements I can not write, since they "exclude" some syntax, and I do not know how to do that... Well, here is an example for you to understand...
Here is a basic rule :
foo::=
any-character+ BUT NOT (foo* escape_character barbar*)
# Knowing that (foo* escape_character barbar*) is included in any-character
How could I translate that using Parslet ? Maybe the absent?/present? stuff ?
Thank you very much, hope someone has an idea....
Have a nice day!
EDIT:
I tried what you said, so here's my translation into Ruby language using parslet:
rule(:line_comment){(source_character.repeat >> line_terminator >> source_character.repeat).absent? >> source_character.repeat(1)}
However, it does not seem to work (the sequence in parens). I did some tests, and came to the conclusion that what's written in my parens is wrong.
Here is a very easier example, let's consider these rules:
# Parslet rules
rule(:source_character) {any}
rule(:line_terminator){ str("\n") >> str("\r").maybe }
rule(:not){source_character.repeat >> line_terminator }
# Which looks like what I try to "detect" up there
I these these rules with this code:
# Code to test :
code = "test
"
But I get that:
Failed to match sequence (SOURCE_CHARACTER{0, } LINE_TERMINATOR) at
line 2 char 1. - Failed to match sequence (SOURCE_CHARACTER{0, }
LINE_TERMINATOR) at line 2 char 1.- Failed to match sequence (' '
' '?) at line 2 char 1.
`- Premature end of input at line 2 char 1. nil
If this sequence doesn't work, my 'complete' rule up there won't ever work... If anyone has an idea, it would be great.
Thank you !
You can do something like this:
rule(:word) { match['^")(\\s'].repeat(1) } # normal word
rule(:op) { str('AND') | str('OR') | str('NOT') }
rule(:keyword) { str('all:') | str('any:') }
rule(:searchterm) { keyword.absent? >> op.absent? >> word }
In this case, the absent? does a lookahead to make sure the next token is not a keyword; if not, then it checks to make sure it's not an operator; if not, finally see if it's a valid word.
An equivalent rule would be:
rule(:searchterm) { (keyword | op).absent? >> word }
Parslet matching is greedy by nature. This means that when you repeat something like
foo.repeat
parslet will match foo until it fails. If foo is
rule(:foo) { any }
you will be on the path to fail, since any.repeat always matches the entire rest of the document!
What you're looking for is something like the string matcher in examples/string_parser.rb (parslet source tree):
rule :string do
str('"') >>
(
(str('\\') >> any) |
(str('"').absent? >> any)
).repeat.as(:string) >>
str('"')
end
What this says is: 'match ", then match either a backslash followed by any character at all, or match any other character, as long as it is not the terminating ".'
So .absent? is really a way to exclude things from a match that follows:
str('foo').absent? >> (str('foo') | str('bar'))
will only match 'bar'. If you understand that, I assume you will be able to resolve your difficulties. Although those will not be the last on your way to a Ruby parser...

Resources