I am currently writting a Ruby parser using Ruby, and more precisely Parslet, since I think it is far more easier to use than Treetop or Citrus. I create my rules using the official specifications, but there are some statements I can not write, since they "exclude" some syntax, and I do not know how to do that... Well, here is an example for you to understand...
Here is a basic rule :
foo::=
any-character+ BUT NOT (foo* escape_character barbar*)
# Knowing that (foo* escape_character barbar*) is included in any-character
How could I translate that using Parslet ? Maybe the absent?/present? stuff ?
Thank you very much, hope someone has an idea....
Have a nice day!
EDIT:
I tried what you said, so here's my translation into Ruby language using parslet:
rule(:line_comment){(source_character.repeat >> line_terminator >> source_character.repeat).absent? >> source_character.repeat(1)}
However, it does not seem to work (the sequence in parens). I did some tests, and came to the conclusion that what's written in my parens is wrong.
Here is a very easier example, let's consider these rules:
# Parslet rules
rule(:source_character) {any}
rule(:line_terminator){ str("\n") >> str("\r").maybe }
rule(:not){source_character.repeat >> line_terminator }
# Which looks like what I try to "detect" up there
I these these rules with this code:
# Code to test :
code = "test
"
But I get that:
Failed to match sequence (SOURCE_CHARACTER{0, } LINE_TERMINATOR) at
line 2 char 1. - Failed to match sequence (SOURCE_CHARACTER{0, }
LINE_TERMINATOR) at line 2 char 1.- Failed to match sequence (' '
' '?) at line 2 char 1.
`- Premature end of input at line 2 char 1. nil
If this sequence doesn't work, my 'complete' rule up there won't ever work... If anyone has an idea, it would be great.
Thank you !
You can do something like this:
rule(:word) { match['^")(\\s'].repeat(1) } # normal word
rule(:op) { str('AND') | str('OR') | str('NOT') }
rule(:keyword) { str('all:') | str('any:') }
rule(:searchterm) { keyword.absent? >> op.absent? >> word }
In this case, the absent? does a lookahead to make sure the next token is not a keyword; if not, then it checks to make sure it's not an operator; if not, finally see if it's a valid word.
An equivalent rule would be:
rule(:searchterm) { (keyword | op).absent? >> word }
Parslet matching is greedy by nature. This means that when you repeat something like
foo.repeat
parslet will match foo until it fails. If foo is
rule(:foo) { any }
you will be on the path to fail, since any.repeat always matches the entire rest of the document!
What you're looking for is something like the string matcher in examples/string_parser.rb (parslet source tree):
rule :string do
str('"') >>
(
(str('\\') >> any) |
(str('"').absent? >> any)
).repeat.as(:string) >>
str('"')
end
What this says is: 'match ", then match either a backslash followed by any character at all, or match any other character, as long as it is not the terminating ".'
So .absent? is really a way to exclude things from a match that follows:
str('foo').absent? >> (str('foo') | str('bar'))
will only match 'bar'. If you understand that, I assume you will be able to resolve your difficulties. Although those will not be the last on your way to a Ruby parser...
Related
I'm just starting with ruby and parslet, so this might be obvious to others (hopefully).
I'm wanting to get all the words up until a delimiter (^) without consuming it
The following rule works (but consumes the delimeter) with a result of {:wrd=>"otherthings"#0, :delim=>"^"#11}
require 'parslet'
class Mini < Parslet::Parser
rule(:word) { match('[a-zA-Z]').repeat}
rule(:delimeter) { str('^') }
rule(:othercontent) { word.as(:wrd) >> delimeter.as(:delim) }
root(:othercontent)
end
puts Mini.new.parse("otherthings^")
I was trying to use the 'present?',
require 'parslet'
class Mini < Parslet::Parser
rule(:word) { match('[a-zA-Z]').repeat}
rule(:delimeter) { str('^') }
rule(:othercontent) { word.as(:wrd) >> delimeter.present? }
root(:othercontent)
end
puts Mini.new.parse("otherthings^")
but this throws an exception:
Failed to match sequence (wrd:WORD &DELIMETER) at line 1 char 12. (Parslet::ParseFailed)
At a later stage I'll want to inspect the word to the right of the delimeter to build up a more complex grammar which is why I don't want to consume the delimeter.
I'm using parslet 1.5.0.
Thanks for your help!
TL;DR;
If you care what is before the "^" you should parse that first.
--- longer answer ---
A parser will always consume all the text. If it can't consume everything, then the document is not fully described by the grammar. Rather than thinking of it as something performing "splits" on your text... instead think of it as a clever state machine consuming a stream of text.
So... as your full grammar needs to consume all the document... when developing your parser, you can't make it to parse some part and leave the rest. You want it to transform your document into a tree so you can manipulate it into it's final from.
If you really wanted to just consume all text before a delimiter, then you could do something like this...
Say I was going to parse a '^' separated list of things.
I could have the following rules
rule(:thing) { (str("^").absent? >> any).repeat(1) } # anything that's not a ^
rule(:list) { thing >> ( str("^") >> thing).repeat(0) } #^ separated list of things
This would work as follows
parse("thing1^thing2") #=> "thing1^thing2"
parse("thing1") #=> "thing1"
parse("thing1^") #=> ERROR ... nothing after the ^ there should be a 'thing'
This would mean list would match a string that doesn't end or start with an '^'. To be useful however I need to pull out the bits that are the values with the "as" keyword
rule(:thing) { (str("^").absent? >> any).repeat(1).as(:thing) }
rule(:list) { thing >> ( str("^") >> thing).repeat(0) }
Now when list matches a string I get an array of hashes of "things".
parse("thing1^thing2") #=> [ {:thing=>"thing1"#0} , {:thing=>"thing2"#7} ]
In reality however you probably care what a 'thing' is... not just anything will go there.
In that case.. you should start by defining those rules... because you don't want to use the parser to split by "^" then re-parse the strings to work out what they are made of.
For example:
parse("6 + 4 ^ 2")
# => [ {:thing=>"6 + 4 "#0}, {:thing=>" 2"#7} ]
And I probably want to ignore the white_space around the "thing"s and I probably want to deal with the 6 the + and the 4 all separately. When I do that I am going to have to throw away my "all things that aren't '^'" rule.
Is there a way to backreference a previous string in parslet similarly to the \1 functionality in typical regular expressions ?
I want to extract the characters within a block such as:
Marker SomeName
some random text, numbers123
and symbols !#%
SomeName
in which "Marker" is a known string but "SomeName" is not known a-priori, so I believe I need something like:
rule(:name) { ( match('\w') >> match('\w\d') ).repeat(1) }
rule(:text_within_the_block) {
str('Marker') >> name >> any.repeat.as(:text_block) >> backreference_to_name
}
What I don't know is how to write the backreference_to_name rule using Parslet and/or Ruby language.
From http://kschiess.github.io/parslet/parser.html
Capturing input
Sometimes a parser needs to match against something that was already
matched against. Think about Ruby heredocs for example:
str = <-HERE
This is part of the heredoc.
HERE
The key to matching this kind of document is to capture part of the
input first and then construct the rest of the parser based on the
captured part. This is what it looks like in its simplest form:
match['ab'].capture(:capt) >> # create the capture
dynamic { |s,c| str(c.captures[:capt]) } # and match using the capture
The key here is that the dynamic block returns a lazy parser. It's only evaluated at the point it's being used and gets passed it's current context to reference at the point of execution.
-- Updated : To add a worked example --
So for your example:
require 'parslet'
require 'parslet/convenience'
class Mini < Parslet::Parser
rule(:name) { match("[a-zA-Z]") >> match('\\w').repeat }
rule(:text_within_the_block) {
str('Marker ') >>
name.capture(:namez).as(:name) >>
str(" ") >>
dynamic { |_,scope|
(str(scope.captures[:namez]).absent? >> any).repeat
}.as(:text_block) >>
dynamic { |src,scope| str(scope.captures[:namez]) }
}
root (:text_within_the_block)
end
puts Mini.new.parse_with_debug("Marker BOB some text BOB") .inspect
#=> {:name=>"BOB"#7, :text_block=>"some text "#11}
This required a couple of changes.
I changed rule(:name) to match a single word and added a str(" ") to detect that word had ended. (Note: \w is short for [A-Za-z0-9_] so it includes digits)
I changed the "any" match to be conditional on the text not being the :name text. (otherwise it consumes the 'BOB' and then fails to match, ie. it's greedy!)
I don't exactly want to support stackoverflow, but as you seem to be a parslet user, here goes: Try asking on the mailing list for a real nice answer. (http://dir.gmane.org/gmane.comp.lang.ruby.parslet)
What you call back-reference here is called a 'capture' in parslet. Please see the example 'capture.rb' in parslets source tree.
I'm looking for a way to match multiple lines Parslet.
The code looks like this:
rule(:line) { (match('$').absent? >> any).repeat >> match('$') }
rule(:lines) { line.repeat }
However, lines will always end up in an infinite loop which is because match('$') will endlessly repeat to match end of string.
Is it possible to match multiple lines that can be empty?
irb(main)> lines.parse($stdin.read)
This
is
a
multiline
string^D
should match successfully. Am I missing something? I also tried (match('$').absent? >> any.maybe).repeat(1) >> match('$') but that doesn't match empty lines.
Regards,
Danyel.
I usually define a rule for end_of_line. This is based on the trick in http://kschiess.github.io/parslet/tricks.html for matching end_of_file.
class MyParser < Parslet::Parser
rule(:cr) { str("\n") }
rule(:eol?) { any.absent? | cr }
rule(:line_body) { (eol?.absent? >> any).repeat(1) }
rule(:line) { cr | line_body >> eol? }
rule(:lines?) { line.repeat (0)}
root(:lines?)
end
puts MyParser.new.parse(""" this is a line
so is this
that was too
This ends""").inspect
Obviously if you want to do more with the parser than you can achieve with String::split("\n") you will replace the line_body with something useful :)
I had a quick go at answering this question and mucked it up. I just though I would explain the mistake I made, and show you how to avoid mistakes of that kind.
Here is my first answer.
rule(:eol) { str('\n') | any.absent? }
rule(:line) { (eol.absent? >> any).repeat >> eol }
rule(:lines) { line.as(:line).repeat }
I didn't follow my usual rules:
Always make repeat count explicit
Any rule that can match zero length strings, should have name ending in a '?'
So lets apply these...
rule(:eol?) { str('\n') | any.absent? }
# as the second option consumes nothing
rule(:line?) { (eol.absent? >> any).repeat(0) >> eol? }
# repeat(0) can consume nothing
rule(:lines?) { line.as(:line?).repeat(0) }
# We have a problem! We have a rule that can consume nothing inside a `repeat`!
Here see why we get an infinite loop. As the input is consumed, you end up with just the end of file, which matches eol? and hence line? (as the line body can be empty). Being inside lines' repeat, it keeps matching without consuming anything and loops forever.
We need to change the line rule so it always consumes something.
rule(:cr) { str('\n') }
rule(:eol?) { cr | any.absent? }
rule(:line_body) { (eol.absent? >> any).repeat(1) }
rule(:line) { cr | line_body >> eol? }
rule(:lines?) { line.as(:line).repeat(0) }
Now line has to match something, either a cr (for empty lines), or at least one character followed by the optional eol?. All repeats have bodies that consume something. We are now golden.
I think you have two, related, problems with your matching:
The pseudo-character match $ does not consume any real characters. You still need to consume the newlines somehow.
Parslet is munging the input in some way, making $ match in places you might not expect. The best result I could get using $ ended up matching each individual character.
Much safer to use \n as the end-of-line character. I got the following to work (I am somewhat of a beginner with Parslet myself, so apologies if it could be clearer):
require 'parslet'
class Lines < Parslet::Parser
rule(:text) { match("[^\n]") }
rule(:line) { ( text.repeat(0) >> match("\n") ) | text.repeat(1) }
rule(:lines) { line.as(:line).repeat }
root :lines
end
s = "This
is
a
multiline
string"
p Lines.new.parse( s )
The rule for the line is complex because of the need to match empty lines and a possible final line without a \n.
You don't have to use the .as(:line) syntax - I just added it to show clearly that the :line rule is matching each line individually, and not simply consuming the whole input.
I am trying to write a method that is the same as mysqli_real_escape_string in PHP. It takes a string and escapes any 'dangerous' characters. I have looked for a method that will do this for me but I cannot find one. So I am trying to write one on my own.
This is what I have so far (I tested the pattern at Rubular.com and it worked):
# Finds the following characters and escapes them by preceding them with a backslash. Characters: ' " . * / \ -
def escape_characters_in_string(string)
pattern = %r{ (\'|\"|\.|\*|\/|\-|\\) }
string.gsub(pattern, '\\\0') # <-- Trying to take the currently found match and add a \ before it I have no idea how to do that).
end
And I am using start_string as the string I want to change, and correct_string as what I want start_string to turn into:
start_string = %("My" 'name' *is* -john- .doe. /ok?/ C:\\Drive)
correct_string = %(\"My\" \'name\' \*is\* \-john\- \.doe\. \/ok?\/ C:\\\\Drive)
Can somebody try and help me determine why I am not getting my desired output (correct_string) or tell me where I can find a method that does this, or even better tell me both? Thanks a lot!
Your pattern isn't defined correctly in your example. This is as close as I can get to your desired output.
Output
"\\\"My\\\" \\'name\\' \\*is\\* \\-john\\- \\.doe\\. \\/ok?\\/ C:\\\\Drive"
It's going to take some tweaking on your part to get it 100% but at least you can see your pattern in action now.
def self.escape_characters_in_string(string)
pattern = /(\'|\"|\.|\*|\/|\-|\\)/
string.gsub(pattern){|match|"\\" + match} # <-- Trying to take the currently found match and add a \ before it I have no idea how to do that).
end
I have changed above function like this:
def self.escape_characters_in_string(string)
pattern = /(\'|\"|\.|\*|\/|\-|\\|\)|\$|\+|\(|\^|\?|\!|\~|\`)/
string.gsub(pattern){|match|"\\" + match}
end
This is working great for regex
This should get you started:
print %("'*-.).gsub(/["'*.-]/){ |s| '\\' + s }
\"\'\*\-\.
Take a look at the ActiveRecord sanitization methods: http://api.rubyonrails.org/classes/ActiveRecord/Base.html#method-c-sanitize_sql_array
Take a look at escape_string / quote method in Mysql class here
Using Ruby I'm trying to split the following text with a Regex
~foo\~\=bar =cheese~monkey
Where ~ or = denotes the beginning of match unless it is escaped with \
So it should match
~foo\~\=bar
then
=cheese
then
~monkey
I thought the following would work, but it doesn't.
([~=]([^~=]|\\=|\\~)+)(.*)
What is a better regex expression to use?
edit To be more specific, the above regex matches all occurrences of = and ~
edit Working solution. Here is what I came up with to solve the issue. I found that Ruby 1.8 has look ahead, but doesn't have lookbehind functionality. So after looking around a bit, I came across this post in comp.lang.ruby and completed it with the following:
# Iterates through the answer clauses
def split_apart clauses
reg = Regexp.new('.*?(?:[~=])(?!\\\\)', Regexp::MULTILINE)
# need to use reverse since Ruby 1.8 has look ahead, but not look behind
matches = clauses.reverse.scan(reg).reverse.map {|clause| clause.strip.reverse}
matches.each do |match|
yield match
end
end
What does "remove the head" mean in this context?
If you want to remove everything before a certain char, this will do:
.*?(?<!\\)= // anything up to the first "=" that is not preceded by "\"
.*?(?<!\\)~ // same, but for the squiggly "~"
.*?(?<!\\)(?=~) // same, but excluding the separator itself (if you need that)
Replace by "", repeat, done.
If your string has exactly three elements ("1=2~3") and you want to match all of them at once, you can use:
^(.*?(?<!\\)(?:=))(.*?(?<!\\)(?:~))(.*)$
matches: \~foo\~\=bar =cheese~monkey
| 1 | 2 | 3 |
Alternatively, you split the string using this regex:
(?<!\\)[=~]
returns: ['\~foo\~\=bar ', 'cheese', 'monkey'] for "\~foo\~\=bar =cheese~monkey"
returns: ['', 'foo\~\=bar ', 'cheese', 'monkey'] for "~foo\~\=bar =cheese~monkey"