Rascal: resolve ambiguity with comment - comments

Consider the following grammar:
module Tst
lexical Id = [a-z][a-z0-9]* !>> [a-z0-9];
layout Layout = WhitespaceAndComment* !>> [\ \t\n\r];
lexical WhitespaceAndComment
= [\ \t\n\r]
| #category="Comment" ^ "*" ![\n]* $
;
start syntax TstStart = Id*;
then
start[TstStart] t = parse(#start[TstStart], "*bla\nABC");
gives an ambiguity, probably because the comment can be placed before or after the empty list of strings.
So, I have 2 questions:
How can I use diagnose() to get a diagnosis? I have tried diagnose(t) and diagnose(parse(#start[TstStart], "*bla\nABC")), without success.
What is the ambiguity and how can I resolve it?

Sorry, it has been a while ago. The comment definition contains a flaw, it has to be corrected as follows:
#category="Comment" ^ "*" ![\n]* $
This resolves the ambiguity, but I still would like to know how to use diagnosis().

The ambiguity is caused by ![*]* which can eat a \n or not, because the whitespace notion can also eat the \n or not.
The diagnose function is notoriously bad at spotting issues with whitespace. That can be improved.
The solution is to use ![*\n]* and end the comment line with "\n"
Another ambiguity will happen with actual * comments in the code, between String* and Id* the comments would go before or after the empty lists. To fix this: layout Layout = WhitespaceAndComment* !>> [\ \t\n\r] !>> [*]; adding the follow restriction with the [*] will help.

When I make the example slightly more complicated I get another ambiguity.
module Tst
lexical Id = [a-z][a-z0-9]* !>> [a-z0-9];
lexical String = "\"" ![\"]* "\"";
layout Layout = WhitespaceAndComment* !>> [\ \t\n\r];
lexical WhitespaceAndComment
= [\ \t\n\r]
| #category="Comment" ^ "*" ![\n]* $
;
start syntax TstStart = String* Id*;
Then "*bla\nABC" gives an ambiguity again. Probably because the comment can be placed before, within and after the empty list of strings. How to resolve it?

Related

Ruby if ... any? ... include? syntax

I need to check if any elements of a large (60,000+ elements) array are present in a long string of text. My current code looks like this:
if $TARGET_PARTLIST.any? { |target_pn| pdf_content_string.include? target_pn }
self.last_match_code = target_pn
self.is_a_match = true
end
I get a syntax error undefined local variable or method target_pn.
Could someone let me know the correct syntax to use for this block of code? Also, if anyone knows of a quicker way to do this, I'm all ears!
In this case, all your syntax is correct, you've just got a logic error. While target_pn is defined (as a parameter) inside the block passed to any?, it is not defined in the block of the if statement because the scope of the any?-block ends with the closing curly brace, and target_pn is not available outside its scope. A correct (and more idiomatic) version of your code would look like this:
self.is_a_match = $TARGET_PARTLIST.any? do |target_pn|
included = pdf_content_string.include? target_pn
self.last_match_code = target_pn if included
included
end
Alternately, as jvillian so kindly suggests, one could turn the string into an array of words, then do an intersection and see if the resulting set is nonempty. Like this:
self.is_a_match = !($TARGET_PARTLIST &
pdf_content_string.gsub(/[^A-Za-z ]/,"")
.split).empty?
Unfortunately, this approach loses self.last_match_code. As a note, pointed out by Sergio, if you're dealing with non-English languages, the above regex will have to be changed.
Hope that helps!
You should use Enumerable#find rather than Enumerable#any?.
found = $TARGET_PARTLIST.find { |target_pn| pdf_content_string.include? target_pn }
if found
self.last_match_code = found
self.is_a_match = true
end
Note this does not ensure that the string contains a word that is an element of $TARGET_PARTLIST. For example, if $TARGET_PARTLIST contains the word "able", that string will be found in the string, "Are you comfortable?". If you only want to match words, you could do the following.
found = $TARGET_PARTLIST.find { |target_pn| pdf_content_string[/\b#{target_pn}\b/] }
Note this uses the method String#[].
\b is a word break in the regular expression, meaning that the first (last) character of the matched cannot be preceded (followed) by a word character (a letter, digit or underscore).
If speed is important it may be faster to use the following.
found = $TARGET_PARTLIST.find { |target_pn|
pdf_content_string.include?(target_on) && pdf_content_string[/\b#{target_pn}\b/] }
A probably more performant way would be to move all this into native code by letting Regexp search for it.
# needed only once
TARGET_PARTLIST_RE = Regexp.new("\\b(?:#{$TARGET_PARTLIST.sort.map { |pl| Regexp.escape(pl) }.join('|')})\\b")
# to check
self.last_match_code = pdf_content_string[TARGET_PARTLIST_RE]
self.is_a_match = !self.last_match_code.nil?
A much more performant way would be to build a prefix tree and create the regexp using the prefix tree (this optimises the regexp lookup), but this is a bit more work :)

gsub a special chracter

hy
i try to use gsub for remove this character ’ be carful it's not ' or ` he come from Word(microsoft) i think .
i really dont understand why i cant remove this character because i can remove all others
when i use gsub like that :
pattern = /(\’|\"|\.|\*|\/|\-|\\|\)|\$|\+|\(|\^|\?|\!|\~|\`)/
restring = string.gsub(pattern){|match|" " }
i get this error below
syntax error, unexpected $end, expecting keyword_end
pattern = /(\’|\"|\.|\*|\/|\-|\\|\)|\$|\+|\(|\^|\?|\!|\~|\`)/
^
When I ran your RegEx through Rubular's site, I got this;
I figured it was a UTF-8 issue and after some additional stack overflow, it seems pretty common in a rails app to add # encoding: utf-8 to the top of your file.
You might add the following to your regex:
/\u2018|\u2019|\u201A/
which are some curly single quotes: ["‘", "’", "‚"].
In case you're interested, here is a simple method I've used before for cleaning up Word text (pieced together from a number of resources online):
def replace(text)
text.
gsub(/[\u2018|\u2019|\u201A]/, "\'").
gsub(/[\u201C|\u201D|\u201E]/, "\"").
gsub(/\u2026/, "...").
gsub(/[\u2013|\u2014]/, "-").
gsub(/\u02C6/, "^").
gsub(/\u2039/, "<").
gsub(/\u203A/, ">").
gsub(/[\u02DC|\u00A0]/, " ")
end

Why pipes are not deleted using "gsub" in Ruby?

I would like to delete from notes everything starting from the example_header. I tried to do:
example_header = <<-EXAMPLE
-----------------
---| Example |---
-----------------
EXAMPLE
notes = <<-HTML
Hello World
#{example_header}
Example Here
HTML
puts notes.gsub(Regexp.new(example_header + ".*", Regexp::MULTILINE), "")
but the output is:
Hello World
||
Why || isn't deleted?
The pipes in your regular expression are being interpreted as the alternation operator. Your regular expression will replace the following three strings:
"-----------------\n---"
" Example "
"---\n-----------------"
You can solve your problem by using Regexp.escape to escape the string when you use it in a regular expression (ideone):
puts notes.gsub(Regexp.new(Regexp.escape(example_header) + ".*",
Regexp::MULTILINE),
"")
You could also consider avoiding regular expressions and just using the ordinary string methods instead (ideone):
puts notes[0, notes.index(example_header)]
Pipes are part of regexp syntax (they mean "or"). You need to escape them with a backslash in order to have them count as actual characters to be matched.

Parslet : exclusion clause

I am currently writting a Ruby parser using Ruby, and more precisely Parslet, since I think it is far more easier to use than Treetop or Citrus. I create my rules using the official specifications, but there are some statements I can not write, since they "exclude" some syntax, and I do not know how to do that... Well, here is an example for you to understand...
Here is a basic rule :
foo::=
any-character+ BUT NOT (foo* escape_character barbar*)
# Knowing that (foo* escape_character barbar*) is included in any-character
How could I translate that using Parslet ? Maybe the absent?/present? stuff ?
Thank you very much, hope someone has an idea....
Have a nice day!
EDIT:
I tried what you said, so here's my translation into Ruby language using parslet:
rule(:line_comment){(source_character.repeat >> line_terminator >> source_character.repeat).absent? >> source_character.repeat(1)}
However, it does not seem to work (the sequence in parens). I did some tests, and came to the conclusion that what's written in my parens is wrong.
Here is a very easier example, let's consider these rules:
# Parslet rules
rule(:source_character) {any}
rule(:line_terminator){ str("\n") >> str("\r").maybe }
rule(:not){source_character.repeat >> line_terminator }
# Which looks like what I try to "detect" up there
I these these rules with this code:
# Code to test :
code = "test
"
But I get that:
Failed to match sequence (SOURCE_CHARACTER{0, } LINE_TERMINATOR) at
line 2 char 1. - Failed to match sequence (SOURCE_CHARACTER{0, }
LINE_TERMINATOR) at line 2 char 1.- Failed to match sequence (' '
' '?) at line 2 char 1.
`- Premature end of input at line 2 char 1. nil
If this sequence doesn't work, my 'complete' rule up there won't ever work... If anyone has an idea, it would be great.
Thank you !
You can do something like this:
rule(:word) { match['^")(\\s'].repeat(1) } # normal word
rule(:op) { str('AND') | str('OR') | str('NOT') }
rule(:keyword) { str('all:') | str('any:') }
rule(:searchterm) { keyword.absent? >> op.absent? >> word }
In this case, the absent? does a lookahead to make sure the next token is not a keyword; if not, then it checks to make sure it's not an operator; if not, finally see if it's a valid word.
An equivalent rule would be:
rule(:searchterm) { (keyword | op).absent? >> word }
Parslet matching is greedy by nature. This means that when you repeat something like
foo.repeat
parslet will match foo until it fails. If foo is
rule(:foo) { any }
you will be on the path to fail, since any.repeat always matches the entire rest of the document!
What you're looking for is something like the string matcher in examples/string_parser.rb (parslet source tree):
rule :string do
str('"') >>
(
(str('\\') >> any) |
(str('"').absent? >> any)
).repeat.as(:string) >>
str('"')
end
What this says is: 'match ", then match either a backslash followed by any character at all, or match any other character, as long as it is not the terminating ".'
So .absent? is really a way to exclude things from a match that follows:
str('foo').absent? >> (str('foo') | str('bar'))
will only match 'bar'. If you understand that, I assume you will be able to resolve your difficulties. Although those will not be the last on your way to a Ruby parser...

Ruby: Escaping special characters in a string

I am trying to write a method that is the same as mysqli_real_escape_string in PHP. It takes a string and escapes any 'dangerous' characters. I have looked for a method that will do this for me but I cannot find one. So I am trying to write one on my own.
This is what I have so far (I tested the pattern at Rubular.com and it worked):
# Finds the following characters and escapes them by preceding them with a backslash. Characters: ' " . * / \ -
def escape_characters_in_string(string)
pattern = %r{ (\'|\"|\.|\*|\/|\-|\\) }
string.gsub(pattern, '\\\0') # <-- Trying to take the currently found match and add a \ before it I have no idea how to do that).
end
And I am using start_string as the string I want to change, and correct_string as what I want start_string to turn into:
start_string = %("My" 'name' *is* -john- .doe. /ok?/ C:\\Drive)
correct_string = %(\"My\" \'name\' \*is\* \-john\- \.doe\. \/ok?\/ C:\\\\Drive)
Can somebody try and help me determine why I am not getting my desired output (correct_string) or tell me where I can find a method that does this, or even better tell me both? Thanks a lot!
Your pattern isn't defined correctly in your example. This is as close as I can get to your desired output.
Output
"\\\"My\\\" \\'name\\' \\*is\\* \\-john\\- \\.doe\\. \\/ok?\\/ C:\\\\Drive"
It's going to take some tweaking on your part to get it 100% but at least you can see your pattern in action now.
def self.escape_characters_in_string(string)
pattern = /(\'|\"|\.|\*|\/|\-|\\)/
string.gsub(pattern){|match|"\\" + match} # <-- Trying to take the currently found match and add a \ before it I have no idea how to do that).
end
I have changed above function like this:
def self.escape_characters_in_string(string)
pattern = /(\'|\"|\.|\*|\/|\-|\\|\)|\$|\+|\(|\^|\?|\!|\~|\`)/
string.gsub(pattern){|match|"\\" + match}
end
This is working great for regex
This should get you started:
print %("'*-.).gsub(/["'*.-]/){ |s| '\\' + s }
\"\'\*\-\.
Take a look at the ActiveRecord sanitization methods: http://api.rubyonrails.org/classes/ActiveRecord/Base.html#method-c-sanitize_sql_array
Take a look at escape_string / quote method in Mysql class here

Resources