How to use lookbehind regexp in go? - ruby

I'm trying to convert this ruby regex to go.
GROUP_CALL = /^(?<i1>[ \t]*)group\(?[ \t]*(?<grps>#{SYMBOLS})[ \t]*\)?[ \t]+do[ \t]*?\n(?<blk>.*?)\n^\k<i1>end[ \t]*$/m
I converted it to
groupCall := regexp.MustCompile("^(?P<i1>[ \\t]*)group\\(?[ \\t]*(?P<grps>(?::\\w+|:\"[^\"#]+\"|:'[^']+')([ \\t]*,[ \\t]*(?::\\w+|:\"[^\"#]+\"|:'[^']+'))*)[ \\t]*\\)?[ \\t]+do[ \\t]*?\\n(?P<blk>.*?)\\n^\\k<i1>end[ \\t]*$/s")
but when run I get this error
error parsing regexp: invalid escape sequence: \k
There's no mention of \k in the go docs, is there no equivalent in go?

lookbehinds aren't supported neither are backreferences like #stribizhev mentioned.
Regular Expression 2 (RE2) Syntax Reference:
https://github.com/google/re2/wiki/Syntax
The syntax of the regular expressions accepted is the same general
syntax used by Perl, Python, and other languages. More precisely, it
is the syntax accepted by RE2 and described at
//code.google.com/p/re2/wiki/Syntax, except for \C. --GoLang Docs
(ref: https://golang.org/pkg/regexp/)

Related

Difference in Unicode behaviour between `\w` vs `[[:alpha:]]` in Ruby

(For this question, ignore the number and underscore matching of \w, which is irrelevant to the discussion here.)
According to the Oniguruma docs, both the shorthand character classes like \w and POSIX classes like [:alpha:] have similar behaviour with regard to Unicode: they have simple ascii behaviour for "Not Unicode Case" (I assume that means the string's encoding is not a Unicode one), and a different behaviour that uses Unicode properties for "Unicode Case".
From that documentation, it sounds as in a case where one of those uses Unicode properties, the other will also use them. However, in practice they seem to differ: the POSIX classes use Unicode properties automatically, whereas the \w type classes have to be explicitly marked with ?u to use Unicode property based matching:
$ ruby -e 'print("~café.".encoding)'
UTF-8
$ ruby -e 'print(/[[:alpha:]]+/.match("~café."))'
café
$ ruby -e 'print(/\w+/.match("~café."))'
caf
$ ruby -e 'print(/(?u)\w+/.match("~café."))'
café
$ ruby -v
ruby 2.3.6p384
Is this a bug, or is my interpretation of the docs wrong? (And what exactly does ?u do, could someone link to where it is documented?)
Since version 2.0, Ruby uses Onigmo, an Oniguruma fork that supports more features implemented in Perl 5.10.
If you compare the doc you linked (Oniguruma) with Onigmo's doc you can see a difference between the \w descriptions:
Oniguruma:
\w word character
Not Unicode:
alphanumeric, "_" and multibyte char.
Unicode: General_Category -- (Letter|Mark|Number|Connector_Punctuation)
Onigmo:
\w word character
Not Unicode:
alphanumeric and "_".
Unicode: General_Category -- (Letter|Mark|Number|Connector_Punctuation)
It depends on ONIG_OPTION_ASCII_RANGE option that non-ASCII char includes or not.
As you can see, there's no more this "and multibyte char." that doesn't make sense (at least for me) and that is probably a typo. Whatever, it's very unclear.
The u modifier switches the shorthand character classes from "Not Unicode" (default) to "Unicode".
That's why you obtain only caf without it and café with it when you try to match it using the character class \w.
On the other side the character class [[:alpha:]] seems to be already extended by default to unicode characters since it matches "café" without the u modifier. A start of explanation can be found in the doc:
It depends on ONIG_OPTION_ASCII_RANGE option and
ONIG_OPTION_POSIX_BRACKET_ALL_RANGE option that POSIX brackets
match non-ASCII char or not.
But you can force it to ascii using the (?a) modifier.

GSUB and Forward Slash usage in Ruby

I often see the gsub function being called with the pattern parameter enclosed in forward slashes. For example:
>> phrase = "*** and *** ran to the ###."
>> phrase.gsub(/\*\*\*/, "WOOF")
=> "WOOF and WOOF ran to the ###."
I thought maybe it had something to do with escaping asterisks, but using single quotes and double quotes works just as well:
>> phrase = "*** and *** ran to the ###."
>> phrase.gsub('***', "WOOF")
=> "WOOF and WOOF ran to the ###."
>> phrase.gsub("***", "WOOF")
=> "WOOF and WOOF ran to the ###."
Is it just convention to use forward slash? What am I missing?
Use forward slashes if you need to use regular expressions.
If you use a string argument with gsub, it will just do a plain character match.
In your example, you need backslashes to escape the asterisks when using a regular expression, because asterisks have a special meaning in regex (optionally match something any number of times). They are not necessary when using a string, because they are just matched exactly.
In your example, you probably don't need to use a regular expression, since it is a simple pattern. However, if you wanted to match *** only when it was at the beginning of a string (e.g. the first bunch in your example), then you would want to use a regex, for example:
phrase.gsub(/^\*{3}/, "WOOF")
For more information on regular expressions, see: http://www.regular-expressions.info/.
For more information on using regular expressions in Ruby, see: http://ruby-doc.org/core-2.2.0/Regexp.html.
To play with regular expressions as they work in Ruby, try: http://rubular.com/.
You are missing reading the documentation:
The pattern is typically a Regexp; if given as a String, any regular expression metacharacters it contains will be interpreted literally, e.g. '\d' will match a backlash followed by ‘d’, instead of a digit.
http://ruby-doc.org/core-2.1.4/String.html#method-i-gsub
In other words, you can give a string or a regular expression. Regular expressions can be delimited several ways:
Regexps are created using the /.../ and %r{...} literals, and by the Regexp::new constructor.
http://ruby-doc.org/core-2.2.2/Regexp.html
The benefit of %r and of the alternate %r delimiters is you can usually find a delimiter that doesn't collide with characters in the pattern, which would force escaping them, as in your example.
* has to be escaped because it has special meaning in a regex, but in a string it does not.

Loading Text File in Ruby in Windows

Aptana is returning:
Invalid escape character syntax
File.open("C:\Users\C*****\Documents\RubyProjects\text.txt
What do I do?
\ is an escape charecter in most languages, so the compiler expects an escaped char after it, in this case its also \, so you just need to use 2 of them
File.open("C:\\Users\\C*****\\Documents\\RubyProjects\\text.txt
Ruby doesn't need you to use reverse slashes. In your string
"C:\Users\C*****\Documents\RubyProjects\text.txt"
you're confusing Ruby because you have reverse-slashes, which denote escapes in a double-quoted (interpreted) string and make Ruby throw up. Instead use:
"C:/Users/C*****/Documents/RubyProjects/text.txt"
From the IO documentation:
Ruby will convert pathnames between different operating system conventions if possible. For instance, on a Windows system the filename "/gumby/ruby/test.rb" will be opened as "\gumby\ruby\test.rb". When specifying a Windows-style filename in a Ruby string, remember to escape the backslashes:
"c:\\gumby\\ruby\\test.rb"
Our examples here will use the Unix-style forward slashes; File::ALT_SEPARATOR can be used to get the platform-specific separator character.

Regular expression "empty range in char class error"

I got a regex in my code, which is to match pattern of url and threw error:
/^(http|https):\/\/([\w-]+\.)+[\w-]+([\w- .\/?%&=]*)?$/
The error was "empty range in char class error". I found the cause of that is in ([\w- .\/?%&=]*)? part. Ruby seems to recognize - in \w- . as an operator for range instead of a literal -. After adding escape to the dash, the problem was solved.
But the original regular expression ran well on my co-workers' machines. We use the same version of osx, rails and ruby: Ruby version is ruby 1.9.3p194, rails is 3.1.6 and osx is 10.7.5. And after we deployed code to our Heroku server, everything worked fine too. Why did only my environment have error regarding this regex? What is the mechanism of Ruby regex interpreting?
I can replicate this error on Ruby 1.9.3p194 (2012-04-20 revision 35410) [i686-linux], installed on Ubuntu 12.04.1 LTS using rvm 1.13.4. However, this should not be a version-specific error. In fact, I'm surprised it worked on the other machines at all.
A a simpler demonstration that fails just as well:
"abcd" =~ /[\w- ]/
This is because [\w- ] is interpreted as "a range beginning with any word character up to space (or blank)", rather than a character class containing a word, a hyphen, or a space, which is what you had intended.
Per Ruby's regular expression documentation:
Within a character class the hyphen (-) is a metacharacter denoting an inclusive range of characters. [abcd] is equivalent to [a-d]. A range can be followed by another range, so [abcdwxyz] is equivalent to [a-dw-z]. The order in which ranges or individual characters appear inside a character class is irrelevant.
As you saw, prepending a backslash escaped the hyphen, thus changing the nature of the regexp from a range to a character class, removing the error. However, escaping the hyphen in the middle of character class is not recommended, since it's easy to confuse the intended meaning of the hyphen in such cases. As m.buettner pointed out, always place hyphens either at the beginning or the end of a character class:
"abcd" =~ /[-\w ]/

How to use şŞıİçÇöÖüÜĞğ characters in a regular expression in Ruby?

I tried using a regular expression to capture names:
r[1].scan(/^([A-Z]|[ŞİÇÖÜĞ])([a-z]|[şŞıİçÇöÖüÜĞğ])*\s([A-Z]|[ŞİÇÖÜĞ])([a-z]|[şŞıİçÇöÖüÜĞğ])*/u)
But, it gives me an error:
syntax error, unexpected $end, expecting ')'
...atches = r[1].scan(/^([A-Z]|[ŞİÇÖÜĞ])([a-z]|[şŞ�...
...
I see that the problem is the Turkish characters I'm using. Is it possible to use unicode values of the characters in regexp? How can I use these problematic characters in this regexp?
Use ruby 1.9
Go with /\p{Word}+\p{Space}\p{Word}*/

Resources