How to define {min,max} matches in treetop peg - ruby

With Ruby's regular expressions I could write /[0-9]{3,}/ I can't figure out how to write this in treetop other than:
rule at_least_three_digit_number
[0-9] [0-9] [0-9]+
end
Is there a 'match [at least|most] n' rule for treetop?

It looks like PEGs don't have some of the RE convenience operators, but in return you do get a much more powerful expression matcher.

http://treetop.rubyforge.org/syntactic_recognition.html
A generalised repetition count (minimum, maximum) is also available.
'foo' 2.. matches 'foo' two or more times
'foo' 3..5 matches 'foo' from three to five times
'foo' ..4 matches 'foo' from zero to four times

Related

How do I specify in Ruby that I want to match a character provided that a sequence following that character does not match a pattern?

I'm using Ruby on Rails 5.1. In Ruby, how do I say taht I want to match a string if the first character matches something but the sequence that follows does NOT match a pattern? That is, I want to match a number provided that the sequence taht follows is not a character from an array I have followed by two other numbers. Here's my character array ...
2.4.0 :010 > TOKENS
=> [":", ".", "'"]
So this string would NOT match
3:00
since ":00" matches the pattern of a character from my array followed by two numbers. But this string
3400
would match. This string would also match
3:0
and this would match
3
since nothing follows the above. How do I write the appropriate regex in Ruby?
string =~ /\A\d+(?!:\d{2})/
This regular expression means:
\A anchors the match to the start of the string.
\d+ means "one or more digits".
(?!...) is a negative look-ahead. It checks that the pattern contained in the brackets does not match., looking ahead from the current position.
:\d{2} means : followed by two digits.
Consideration should be given to testing the first character and the remaining characters separately.
def match_it?(str, first_char_regex, no_match_regex)
str[0].match?(first_char_regex) && !str[1..-1].match?(no_match_regex)
end
match_it?("0:00", /0/, /\A[:. ]cat\z/) #=> true
match_it?("0:00", /\d/, /\A[:. ]\d+\z/) #=> false
match_it?("0:00", /[[:alpha:]]/, /\A[:. ]\d+\z/) #=> false
I believe this reads well and it simplifies testing when compared to methods that employ a single regular expression.

How do I write a regular expression that will match characters in any order?

I'm trying to write a regular expressions that will match a set of characters without regard to order. For example:
str = "act"
str.scan(/Insert expression here/)
would match:
cat
act
tca
atc
tac
cta
but would not match ca, ac or cata.
I read through a lot of similar questions and answers here on StackOverflow, but have not found one that matches my objectives exactly.
To clarify a bit, I'm using ruby and do not want to allow repeat characters.
Here is your solution
^(?:([act])(?!.*\1)){3}$
See it here on Regexr
^ # matches the start of the string
(?: # open a non capturing group
([act]) # The characters that are allowed and a capturing group
(?!.*\1) # That character is matched only if it does not occur once more, Lookahead assertion
){3} # Defines the amount of characters
$
The only special think is the lookahead assertion, to ensure the character is not repeated.
^ and $ are anchors to match the start and the end of the string.
[act]{3} or ^[act]{3}$ will do it in most regular expression dialects. If you can narrow down the system you're using, that will help you get a more specific answer.
Edit: as mentioned by #georgydyer in the comments below, it's unclear from your question whether or not repeated characters are allowed. If not, you can adapt the answer from this question and get:
^(?=[act]{3}$)(?!.*(.).*\1).*$
That is, a positive lookahead to check a match, and then a negative lookahead with a backreference to exclude repeated characters.
Here's how I'd go about it:
regex = /\b(?:#{ Regexp.union(str.split('').permutation.map{ |a| a.join }).source })\b/
# => /(?:act|atc|cat|cta|tac|tca)/
%w[
cat act tca atc tac cta
ca ac cata
].each do |w|
puts '"%s" %s' % [w, w[regex] ? 'matches' : "doesn't match"]
end
That outputs:
"cat" matches
"act" matches
"tca" matches
"atc" matches
"tac" matches
"cta" matches
"ca" doesn't match
"ac" doesn't match
"cata" doesn't match
I use the technique of passing an array into Regexp.union for a lot of things; I works especially well with the keys of a hash, and passing the hash into gsub for rapid search/replace on text templates. This is the example from the gsub documentation:
'hello'.gsub(/[eo]/, 'e' => 3, 'o' => '*') #=> "h3ll*"
Regexp.union creates a regex, and it's important to use source instead of to_s when extracting the actual pattern being generated:
puts regex.to_s
=> (?-mix:\b(?:act|atc|cat|cta|tac|tca)\b)
puts regex.source
=> \b(?:act|atc|cat|cta|tac|tca)\b
Notice how to_s embeds the pattern's flags inside the string. If you don't expect them you can accidentally embed that pattern into another, which won't behave as you expect. Been there, done that and have the dented helmet as proof.
If you really want to have fun, look into the Perl Regexp::Assemble module available on CPAN. Using that, plus List::Permutor, lets us generate more complex patterns. On a simple string like this it won't save much space, but on long strings or large arrays of desired hits it can make a huge difference. Unfortunately, Ruby has nothing like this, but it is possible to write a simple Perl script with the word or array of words, and have it generate the regex and pass it back:
use List::Permutor;
use Regexp::Assemble;
my $regex_assembler = Regexp::Assemble->new;
my $perm = new List::Permutor split('', 'act');
while (my #set = $perm->next) {
$regex_assembler->add(join('', #set));
}
print $regex_assembler->re, "\n";
(?-xism:(?:a(?:ct|tc)|c(?:at|ta)|t(?:ac|ca)))
See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for more information about using Regexp::Assemble with Ruby.
I will assume several things here:
- You are looking for permutations of given characters
- You are using ruby
str = "act"
permutations = str.split(//).permutation.map{|p| p.join("")}
# and for the actual test
permutations.include?("cat")
It is no regex though.
No doubt - the regex that uses positive/negative lookaheads and backreferences is slick, but if you're only dealing with three characters, I'd err on the side of verbosity by explicitly enumerating the character permutations like #scones suggested.
"act".split('').permutation.map(&:join)
=> ["act", "atc", "cat", "cta", "tac", "tca"]
And if you really need a regex out of it for scanning a larger string, you can always:
Regexp.union "act".split('').permutation.map(&:join)
=> /\b(act|atc|cat|cta|tac|tca)\b/
Obviously, this strategy doesn't scale if your search string grows, but it's much easier to observe the intent of code like this in my opinion.
EDIT: Added word boundaries for false positive on cata based on #theTinMan's feedback.

How can I avoid left-recursion in treetop without backtracking?

I am having trouble avoiding left-recursion in this simple expression parser I'm working on. Essentially, I want to parse the equation 'f x y' into two expressions 'f x' and '(f x) y' (with implicit parentheses). How can I do this while avoiding left-recursion and backtracking? Does there have to be an intermediate step?
#!/usr/bin/env ruby
require 'rubygems'
require 'treetop'
Treetop.load_from_string DATA.read
parser = ExpressionParser.new
p parser.parse('f x y').value
__END__
grammar Expression
rule equation
expression (w+ expression)*
end
rule expression
expression w+ atom
end
rule atom
var / '(' w* expression w* ')'
end
rule var
[a-z]
end
rule w
[\s\n\t\r]
end
end
You haven't given enough information about your desired result. In particular, do you expect "f(a b) y" to parse as "(f(a(b))) y"? I assume you do... which means that a function not followed by an open parenthesis has arity one.
So you want to say:
rule equation
expression w* var / expression w* parenthesised_list
end
rule parenthesised_list
'(' w* ( expression w* )+ ')'
end
If on the other hand you have external (to the grammar) knowledge of the arity of f, and you want to iterate "expression" exactly that many times - as happens in parsing TeX for example - then you will need to use a semantic predicate &{|s| ...} inside the iterated expression list). Beware that the argument passed to the block of a sempred is not a SyntaxNode (which cannot yet be constructed because this sequence sub-rule has not yet succeeded) but the accumulated array of nodes so far in the sequence. The truthiness of the block return value dictates the parse result and can stop the iteration.
Another tool you might consider using is lookahead (!stuff_I_dont_expect_to_follow or &stuff_that_must_follow).
You can also ask such questions in http://groups.google.com/group/treetop-dev

Can I 'unmatch' a rule programmatically in treetop?

Is it possibe to skip a rule by validating it using ruby code in treetop?
Say there is something like this:
rule short_words
[a-z]+ {
def method1
text_value
end
...
}
end
And I want the words size to be from 2 to 5 letters. Can I exit rule if I find that the length of text_value is not between 2 and 5?
Treetop's syntax supports {min,max} bounds on matches. (Excerpt from http://treetop.rubyforge.org/syntactic_recognition.html)
Repetition count
A generalised repetition count (minimum, maximum) is also available.
* 'foo' 2.. matches 'foo' two or more times
* 'foo' 3..5 matches 'foo' from three to five times
* 'foo' ..4 matches 'foo' from zero to four times

Regular expression for not matching two underscores

I don't know whether it's really easy and I'm out of my mind....
In Ruby's regular expressions, how to match strings which do not contain two consecutive underscores, i.e., "__".
Ex:
Matches: "abcd", "ab_cd", "a_b_cd", "%*##_#+"
Does not match: "ab__cd", "a_b__cd"
-thanks
EDIT: I can't use reverse logic, i.e., checking for "__" strings and excluding them, since need to use with Ruby on Rails "validates_format_of()" which expects a regular expression with which it will match.
You could use negative lookahead:
^((?!__).)*$
The beginning-of-string ^ and end of string $ are important, they force a check of "not followed by double underscore" on every position.
/^([^_]*(_[^_])?)*_?$/
Tests:
regex=/^([^_]*(_[^_])?)*_?$/
# Matches
puts "abcd" =~ regex
puts "ab_cd" =~ regex
puts "a_b_cd" =~ regex
puts "%*##_#+" =~ regex
puts "_" =~ regex
puts "_a_" =~ regex
# Non-matches
puts "__" =~ regex
puts "ab__cd" =~ regex
puts "a_b__cd" =~ regex
But regex is overkill for this task. A simple string test is much easier:
puts ('a_b'['__'])
Would altering your logic still be valid?
You could check if the string contains two underscores with the regular expression [_]{2} and then just ignore it?
Negative lookahead
\b(?!\w*__\w*)\w+\b
Search for two consecutive underscores in the next word from the beginning of the word, and match that word if it is not found.
Edit: To accommodate anything other than whitespaces in the match:
(?!\S*__\S*)\S+
If you wish to accommodate a subset of symbols, you can write something like the following, but then it will match _cd from a_b__cd among other things.
(?![a-zA-Z0-9_%*##+]*__[a-zA-Z0-9_%*##+]*)[a-zA-Z0-9_%*##+]+

Resources