Taking as a starting point the code example from the Parslet's own creator (available in this link) I need to extend it so as to retrieve all the non-commented text from a file written in a C-like syntax.
The provided example is able to successfully parse C-style comments, treating these areas as regular line spaces. However, this simple example only expects 'a' characters in the non-commented areas of the file such as the input example:
a
// line comment
a a a // line comment
a /* inline comment */ a
/* multiline
comment */
The rule used to detect the non-commented text is simply:
rule(:expression) { (str('a').as(:a) >> spaces).as(:exp) }
Therefore, what I need is to generalize the previous rule to get all the other (non-commented) text from a more generic file such as:
word0
// line comment
word1 // line comment
phrase /* inline comment */ something
/* multiline
comment */
I am new to Parsing Expression Grammars and neither of my previous trials succeeded.
The general idea is that everything is code (aka non-comment) until one of the sequences // or /* appears. You can reflect this with a rule like this:
rule(:code) {
(str('/*').absent? >> str('//').absent? >> any).repeat(1).as(:code)
}
As mentioned in my comment, there is a small problem with strings, though. When a comment occurs inside a string, it obviously is part of the string. If you were to remove comments from your code, you would then alter the meaning of this code. Therefore, we have to let the parser know what a string is, and that any character inside there belongs to it. Another thing are escape sequences. For example the string "foo \" bar /*baz*/", which contains a literal double quote, would actually be parsed as "foo \", followed by some code again. This is of course something that needs to be addressed. I have written a complete parser that handles all of the above cases:
require 'parslet'
class CommentParser < Parslet::Parser
rule(:eof) {
any.absent?
}
rule(:block_comment_text) {
(str('*/').absent? >> any).repeat.as(:comment)
}
rule(:block_comment) {
str('/*') >> block_comment_text >> str('*/')
}
rule(:line_comment_text) {
(str("\n").absent? >> any).repeat.as(:comment)
}
rule(:line_comment) {
str('//') >> line_comment_text >> (str("\n").present? | eof)
}
rule(:string_text) {
(str('"').absent? >> str('\\').maybe >> any).repeat
}
rule(:string) {
str('"') >> string_text >> str('"')
}
rule(:code_without_strings) {
(str('"').absent? >> str('/*').absent? >> str('//').absent? >> any).repeat(1)
}
rule(:code) {
(code_without_strings | string).repeat(1).as(:code)
}
rule(:code_with_comments) {
(code | block_comment | line_comment).repeat
}
root(:code_with_comments)
end
It will parse your input
word0
// line comment
word1 // line comment
phrase /* inline comment */ something
/* multiline
comment */
to this AST
[{:code=>"\n word0\n "#0},
{:comment=>" line comment"#13},
{:code=>"\n word1 "#26},
{:comment=>" line comment"#37},
{:code=>"\n phrase "#50},
{:comment=>" inline comment "#61},
{:code=>" something \n "#79},
{:comment=>" multiline\n comment "#94},
{:code=>"\n"#116}]
To extract everything except the comments you can do:
input = <<-CODE
word0
// line comment
word1 // line comment
phrase /* inline comment */ something
/* multiline
comment */
CODE
ast = CommentParser.new.parse(input)
puts ast.map{|node| node[:code] }.join
which will produce
word0
word1
phrase something
Another way to handle comments is to consider them white space. For example:
rule(:space?) do
space.maybe
end
rule(:space) do
(block_comment | line_comment | whitespace).repeat(1)
end
rule(:whitespace) do
match('/s')
end
rule(:block_comment) do
str('/*') >>
(str('*/').absent >> match('.')).repeat(0) >>
str('*/')
end
rule (:line_comment) do
str('//') >> match('[^\n]') >> str("\n")
end
Then, when you are writing rules with white-space, such as this entirely off-the-cuff and probably wrong rule for C,
rule(:assignment_statement) do
lvalue >> space? >> str('=') >> space? >> rvalue >> str(';')
end
comments get "eaten" by the parser without any fuss. Anywhere white space can or must appear, comments of any kind are allowed, and are treated as white space.
This approach is not as suitable for your exact problem, which is to recognize non-comment text in a C program, but it works very well in a parser which must recognize the full language.
Related
I wrote a correctly working sed script which replaces multiple spaces with single space between tokens (it skips lines with # or //) :
#!/bin/sed -f
/.*#/ !{
/\/\//n
# handle more than one space between tokens
s/\([^ ]\)\s\+/\1 /g
}
i run it on ubuntu like this: ./spaces.sed < spa.txt
spa.txt:
/** spa.txt text
date : some date
hih+jjhh jgjg
if ( hjh>=hjhjh )
y **/
# this is a comment
// this is a comment
lines begins here ;
/****** this line is comment ****/
some more lines
// again comment
more lines words
/** again multi line co
mmment it
comment line
follows till here**/
file ends
now i want to add the functionality that script should skip over lines between a pattern (pattern can be distributed in multiple lines). This is the pattern: /* and */
I tried many things but of no use:
#!/bin/sed -f
/.*#/ !{
/\/\*/,/\*\// {
/\/\*/n #it skips successfully the /* line
n #also skips next line
/\*\// !{
}
}
/\/\//n
# handle more than one space between tokens
s/\([^ ]\)\s\+/\1 /g
}
but script isn't working as expected.
Expected output:
/** spa.txt text
date : some date
hih+jjhh jgjg
if ( hjh>=hjhjh )
y **/
# this is a comment
// this is a comment
lines begins here ;
/****** this line is comment ****/
some more lines
// again comment
more lines words
/** again multi line co
mmment it
comment line
follows till here**/
file ends
suggestions?
Thanks
I'd re-engineer the script a bit, to handle # and // comments on their own. With the /* … */ comments, you have to deal with single-line and multi-line variants separately. I'd also use the [[:space:]] notation to spot spaces or tabs. I prefer to avoid backslashes (an aversion caused by working with troff in the days of my youth — if you've never needed 16 backslashes in a row to get the desired effect, you've not suffered enough), so I use \%…% to choose the % character as the search marker instead of / (which means there's no need to escape the slashes in the pattern with a backslash), and I use [*] instead of \*. The { p; d; } notation prints the current line and then deletes it and moves onto the next line. (Using n appends the next line to the current line; it isn't what you need.). The second semicolon isn't required by GNU sed but is by BSD (macOS) sed. The spaces in those braces are optional but make it easier to read.
Putting this together, you might have spaces.sed like this:
#!/bin/sed -f
# Comments with a #
/#/ { p; d; }
# Comments with //
\%//% { p; d; }
# Single line /* ... */ comments
\%/[*].*[*]/% { p; d; }
# Multi-line /* ... */ comments
\%/[*]%,\%[*]/% { p; d; }
s/\([^[:space:]]\)[[:space:]]\{2,\}/\1 /g
On your sample data (thanks for including it!), this produces:
/** spa.txt text
date : some date
hih+jjhh jgjg
if ( hjh>=hjhjh )
y **/
# this is a comment
// this is a comment
lines begins here ;
/****** this line is comment ****/
some more lines
// again comment
more lines words
/** again multi line co
mmment it
comment line
follows till here**/
file ends
That looks like what you wanted.
Limitations
It doesn't remove multiple spaces at the start of a line.
the leading blanks are not removed.
If you have a line with multiple spaces and // or #, the multiple spaces remain:
these spaces // survive
so do # these
If you have multiple single line comments on a single line, you don't get spaces removed in between them:
/* these */ spaces are not /* removed */
If you have a single-line comment and the start of a multi-line comment on a single line, the multi-line comment is not spotted. Similarly, if you have a multi-line comment that ends on a line and has a single-line comment starting after it, then if there are any multiple spaces between the end of the one comment and the start of the next, they are not handled.
/* this */ is not /* handled
very well */ nor are these /* spaces */
This doesn't deal with the subtleties of backslash-newline in the middle of a start or end comment symbol, nor with backslash-newline at the end of a // comment. Only brain-dead programs (or programmers) produce such comments, so it shouldn't be a real problem. Fortunately, you're not writing a compiler; those have to deal with the nonsense. And don't get me started on trigraphs!
It doesn't handle comment-like sequences inside strings (or multi-character character constants):
"/* this is not a comment */"
'/*', ' ', '*/'
However, most of these issues are subtle enough that you're probably OK without dealing with them. If you must deal with them, then you need a program, not a sed script (assuming you value your sanity).
I have such task to do but I have no idea how to write it with sed function.
I have to change the way on commenting in a file from:
//something6
//something4
//something5
//something3
//something2
to
/*something6
* something4
* something5
* something3
* something2*/
from
//something6
//something4
//something5
//something3
//something2
to
/*something6
something4
something5
something3
something2*/
from
/*something6
* something4
* something5
* something3
* something2*/
to
//something6
//something4
//something5
//something3
//something2
from
/*something6
something4
something5
something3
something2*/
to
//something6
//something4
//something5
//something3
//something2
Those 4 patterns must be made by sed function (I guess but not sure about that).
Tried doing it but without luck. I can replace single words to other ones but how to change the way of commenting? No clue. Would be very gratefull for help and assisstance.
Given that the task is:
Please write a script that allows to change style of comments in source files for example : /* .... */ goes to // .... The style of comment is an argument of the script.
I have tried to use just typical:
sed -i 's/'"$lookingfor"'/'"$changing"'/g' $filename
In this context, either $lookingfor or $changing or both will contain slashes, so that simple formulation doesn't work, as you correctly observe.
The conversion of // comments to /* comments is easy as long as you know that you can choose an arbitrary character to separate the sections of the s/// command, such as %. So, for example, you could use:
sed -i.bak -e 's%// *\(.*\)%/*\1 */%'
This looks for a double-slash followed by zero or more spaces and anything and converts it to /* anything */.
The conversion of /* comments is much harder. There are two cases to be concerned about:
/* A single line comment */
/*
** A multiline comment
*/
That's before you get into:
/* OK */ "/* OK */" /* Really?! */
which is a single line containing two comments and a string containing text that outside a string would look like a comment. This I am studiously ignoring! Or, more accurately, I am studiously deciding that it will be OK when converted to:
// OK */ "/* OK */" /* Really?!
which isn't the same at all, but serves you right for writing convoluted C in the first place.
You can deal with the first case with something like:
sed -e '\%/\*\(.*\)\*/% { s%%//\1%; n; }'
I have the grouping braces and the n command in there so that single line comments don't also match the second case:
-e '\%/\*%,\%\*/% {
\%/\*% { s%/\*\(.*\)%//\1%; n; }
\%\*/% { s%\(.*\)\*/%//\1%; n; }
s%^\( *\)%\1//%
}'
The first line selects a range of lines between one matching /* and the next matching */. The \% tells sed to use the % instead of / as the search delimiter. There are three operations within the outer grouping { … }:
Convert /*anything into //anything and start on the next line.
Convert anything*/ into //anything and start on the next line.
Convert any other line so that it preserves leading blanks but puts // after them.
This is still ridiculously easy to subvert if the comments are maliciously formed. For example:
/* a comment */ int x = 0;
is mapped to:
// a comment int x = 0;
Fixing problems like that, and the example with a string, is something I'd not even start trying in sed. And that's before you get onto the legal but implausible C comments, like:
/\
\
* comment
*\
\
/
/\
/\
noisiness \
commentary \
continued
Which contains just two comments (but does contain two comments!). And before you decide to deal with trigraphs (??/ is a backslash). Etc.
So, a moderate approximation to a C to C++ comment conversion is:
sed -e '\%/\*\(.*\)\*/% { s%%//\1%; n; }' \
-e '\%/\*%,\%\*/% {
\%/\*% { s%/\*\(.*\)%//\1%; n; }
\%\*/% { s%\(.*\)\*/%//\1%; n; }
s%^\( *\)%\1//%
}' \
-i.bak "$#"
I'm assuming you aren't using a C shell; if you are, you need more backslashes at the ends of the lines in the script so that the multi-line single-quoted sed command is treated correctly.
Is there a way to backreference a previous string in parslet similarly to the \1 functionality in typical regular expressions ?
I want to extract the characters within a block such as:
Marker SomeName
some random text, numbers123
and symbols !#%
SomeName
in which "Marker" is a known string but "SomeName" is not known a-priori, so I believe I need something like:
rule(:name) { ( match('\w') >> match('\w\d') ).repeat(1) }
rule(:text_within_the_block) {
str('Marker') >> name >> any.repeat.as(:text_block) >> backreference_to_name
}
What I don't know is how to write the backreference_to_name rule using Parslet and/or Ruby language.
From http://kschiess.github.io/parslet/parser.html
Capturing input
Sometimes a parser needs to match against something that was already
matched against. Think about Ruby heredocs for example:
str = <-HERE
This is part of the heredoc.
HERE
The key to matching this kind of document is to capture part of the
input first and then construct the rest of the parser based on the
captured part. This is what it looks like in its simplest form:
match['ab'].capture(:capt) >> # create the capture
dynamic { |s,c| str(c.captures[:capt]) } # and match using the capture
The key here is that the dynamic block returns a lazy parser. It's only evaluated at the point it's being used and gets passed it's current context to reference at the point of execution.
-- Updated : To add a worked example --
So for your example:
require 'parslet'
require 'parslet/convenience'
class Mini < Parslet::Parser
rule(:name) { match("[a-zA-Z]") >> match('\\w').repeat }
rule(:text_within_the_block) {
str('Marker ') >>
name.capture(:namez).as(:name) >>
str(" ") >>
dynamic { |_,scope|
(str(scope.captures[:namez]).absent? >> any).repeat
}.as(:text_block) >>
dynamic { |src,scope| str(scope.captures[:namez]) }
}
root (:text_within_the_block)
end
puts Mini.new.parse_with_debug("Marker BOB some text BOB") .inspect
#=> {:name=>"BOB"#7, :text_block=>"some text "#11}
This required a couple of changes.
I changed rule(:name) to match a single word and added a str(" ") to detect that word had ended. (Note: \w is short for [A-Za-z0-9_] so it includes digits)
I changed the "any" match to be conditional on the text not being the :name text. (otherwise it consumes the 'BOB' and then fails to match, ie. it's greedy!)
I don't exactly want to support stackoverflow, but as you seem to be a parslet user, here goes: Try asking on the mailing list for a real nice answer. (http://dir.gmane.org/gmane.comp.lang.ruby.parslet)
What you call back-reference here is called a 'capture' in parslet. Please see the example 'capture.rb' in parslets source tree.
I'm looking for a way to match multiple lines Parslet.
The code looks like this:
rule(:line) { (match('$').absent? >> any).repeat >> match('$') }
rule(:lines) { line.repeat }
However, lines will always end up in an infinite loop which is because match('$') will endlessly repeat to match end of string.
Is it possible to match multiple lines that can be empty?
irb(main)> lines.parse($stdin.read)
This
is
a
multiline
string^D
should match successfully. Am I missing something? I also tried (match('$').absent? >> any.maybe).repeat(1) >> match('$') but that doesn't match empty lines.
Regards,
Danyel.
I usually define a rule for end_of_line. This is based on the trick in http://kschiess.github.io/parslet/tricks.html for matching end_of_file.
class MyParser < Parslet::Parser
rule(:cr) { str("\n") }
rule(:eol?) { any.absent? | cr }
rule(:line_body) { (eol?.absent? >> any).repeat(1) }
rule(:line) { cr | line_body >> eol? }
rule(:lines?) { line.repeat (0)}
root(:lines?)
end
puts MyParser.new.parse(""" this is a line
so is this
that was too
This ends""").inspect
Obviously if you want to do more with the parser than you can achieve with String::split("\n") you will replace the line_body with something useful :)
I had a quick go at answering this question and mucked it up. I just though I would explain the mistake I made, and show you how to avoid mistakes of that kind.
Here is my first answer.
rule(:eol) { str('\n') | any.absent? }
rule(:line) { (eol.absent? >> any).repeat >> eol }
rule(:lines) { line.as(:line).repeat }
I didn't follow my usual rules:
Always make repeat count explicit
Any rule that can match zero length strings, should have name ending in a '?'
So lets apply these...
rule(:eol?) { str('\n') | any.absent? }
# as the second option consumes nothing
rule(:line?) { (eol.absent? >> any).repeat(0) >> eol? }
# repeat(0) can consume nothing
rule(:lines?) { line.as(:line?).repeat(0) }
# We have a problem! We have a rule that can consume nothing inside a `repeat`!
Here see why we get an infinite loop. As the input is consumed, you end up with just the end of file, which matches eol? and hence line? (as the line body can be empty). Being inside lines' repeat, it keeps matching without consuming anything and loops forever.
We need to change the line rule so it always consumes something.
rule(:cr) { str('\n') }
rule(:eol?) { cr | any.absent? }
rule(:line_body) { (eol.absent? >> any).repeat(1) }
rule(:line) { cr | line_body >> eol? }
rule(:lines?) { line.as(:line).repeat(0) }
Now line has to match something, either a cr (for empty lines), or at least one character followed by the optional eol?. All repeats have bodies that consume something. We are now golden.
I think you have two, related, problems with your matching:
The pseudo-character match $ does not consume any real characters. You still need to consume the newlines somehow.
Parslet is munging the input in some way, making $ match in places you might not expect. The best result I could get using $ ended up matching each individual character.
Much safer to use \n as the end-of-line character. I got the following to work (I am somewhat of a beginner with Parslet myself, so apologies if it could be clearer):
require 'parslet'
class Lines < Parslet::Parser
rule(:text) { match("[^\n]") }
rule(:line) { ( text.repeat(0) >> match("\n") ) | text.repeat(1) }
rule(:lines) { line.as(:line).repeat }
root :lines
end
s = "This
is
a
multiline
string"
p Lines.new.parse( s )
The rule for the line is complex because of the need to match empty lines and a possible final line without a \n.
You don't have to use the .as(:line) syntax - I just added it to show clearly that the :line rule is matching each line individually, and not simply consuming the whole input.
I am currently writting a Ruby parser using Ruby, and more precisely Parslet, since I think it is far more easier to use than Treetop or Citrus. I create my rules using the official specifications, but there are some statements I can not write, since they "exclude" some syntax, and I do not know how to do that... Well, here is an example for you to understand...
Here is a basic rule :
foo::=
any-character+ BUT NOT (foo* escape_character barbar*)
# Knowing that (foo* escape_character barbar*) is included in any-character
How could I translate that using Parslet ? Maybe the absent?/present? stuff ?
Thank you very much, hope someone has an idea....
Have a nice day!
EDIT:
I tried what you said, so here's my translation into Ruby language using parslet:
rule(:line_comment){(source_character.repeat >> line_terminator >> source_character.repeat).absent? >> source_character.repeat(1)}
However, it does not seem to work (the sequence in parens). I did some tests, and came to the conclusion that what's written in my parens is wrong.
Here is a very easier example, let's consider these rules:
# Parslet rules
rule(:source_character) {any}
rule(:line_terminator){ str("\n") >> str("\r").maybe }
rule(:not){source_character.repeat >> line_terminator }
# Which looks like what I try to "detect" up there
I these these rules with this code:
# Code to test :
code = "test
"
But I get that:
Failed to match sequence (SOURCE_CHARACTER{0, } LINE_TERMINATOR) at
line 2 char 1. - Failed to match sequence (SOURCE_CHARACTER{0, }
LINE_TERMINATOR) at line 2 char 1.- Failed to match sequence (' '
' '?) at line 2 char 1.
`- Premature end of input at line 2 char 1. nil
If this sequence doesn't work, my 'complete' rule up there won't ever work... If anyone has an idea, it would be great.
Thank you !
You can do something like this:
rule(:word) { match['^")(\\s'].repeat(1) } # normal word
rule(:op) { str('AND') | str('OR') | str('NOT') }
rule(:keyword) { str('all:') | str('any:') }
rule(:searchterm) { keyword.absent? >> op.absent? >> word }
In this case, the absent? does a lookahead to make sure the next token is not a keyword; if not, then it checks to make sure it's not an operator; if not, finally see if it's a valid word.
An equivalent rule would be:
rule(:searchterm) { (keyword | op).absent? >> word }
Parslet matching is greedy by nature. This means that when you repeat something like
foo.repeat
parslet will match foo until it fails. If foo is
rule(:foo) { any }
you will be on the path to fail, since any.repeat always matches the entire rest of the document!
What you're looking for is something like the string matcher in examples/string_parser.rb (parslet source tree):
rule :string do
str('"') >>
(
(str('\\') >> any) |
(str('"').absent? >> any)
).repeat.as(:string) >>
str('"')
end
What this says is: 'match ", then match either a backslash followed by any character at all, or match any other character, as long as it is not the terminating ".'
So .absent? is really a way to exclude things from a match that follows:
str('foo').absent? >> (str('foo') | str('bar'))
will only match 'bar'. If you understand that, I assume you will be able to resolve your difficulties. Although those will not be the last on your way to a Ruby parser...