Ruby regex ‘backslash R’ aka ‘\R’ pattern - ruby

I am pretty sure I have seen “\R was introduced in Ruby2 to match newlines, despite where they came from: unix \n, macos \r or windows \r\n” somewhere. That said, Ruby2 should treat \R like %r{\r\n|\r|\n}.
This works fine:
▶ "a\nb".match /\R/
#⇒ #<MatchData "\n">
▶ "a\rb".match /\R/
#⇒ #<MatchData "\r">
▶ "a\r\nb".match /\R/
#⇒ #<MatchData "\r\n">
even whether line endings/feeds are combined:
▶ "a\r\n\nb".match /\R{2}/
#⇒ #<MatchData "\r\n\n">
unless one tries to negate \R:
▶ "a\nb".match /[^\R]+/
#⇒ #<MatchData "a\nb">
Negating \n works fine though:
▶ "a\nb".match /[^\n]+/
#⇒ #<MatchData "a">
Unfortunately, \R is enormously hard to google. Neither Regexp rdoc nor Regular Expressions have a mention of it.
Would any regex guru drop an explanation here, so that it was at least easily googled?
Thanks in advance.

This is from the author: https://github.com/k-takata/Onigmo/blob/master/doc/RE#L101. It says
\R Linebreak
Unicode:
(?>\x0D\x0A|[\x0A-\x0D\x{85}\x{2028}\x{2029}])
Not Unicode:
(?>\x0D\x0A|[\x0A-\x0D])
What seems relevant here to your question is that it is not a character group, but is a list of alternatives. Given that the sequence is not necessarily a single character, I guess it could not be made into a character group. This is probably interacting in peculiar way with negation, which is intended to be used only with characters and/or character groups.

Related

How to write \1 in string [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I'm trying to write \1 in a string, but I can't do it. I would appreciate if somebody helped me with this strange behaviour. Here is an example with some explaining.
EDIT: Adding example output
puts "\1 <- null"
puts "\\1 <- slash one"
works!
but typing
"\1"
"\\1"
in the irb command line yields
"\1"
=> "\u0001
"\\1"
=> "\\1"
There are a few ways to get it:
"\\1"
'\1'
?\\ + ?1
Remember that the way it will show up is always "\\1", which means literal backslash, one, which is what you want. The way to know that this is correct is to use puts:
puts "\\1"
# => \1
Inside of double-quoted strings, backslashes have significant meaning. \n means the newline character. In single quoted strings, that's two characters: backslash and n.
You can even test this:
"\\1".chars
# => ["\\", "1"]
'\1'.chars
# => ["\\", "1"]
So you can see Ruby is interpreting that as two characters, not three. Don't be fooled by the second backslash inside a double-quoted string. That's how a literal backslash is represented.
Have you tried puts '\1'? (single quotes instead of double)
I'm not 100% sure what you're asking but if that helps, cheers.
Your command line shows "\1" because irb does .inspect on the object, which escapes the string. So essentially \1 is properly stored, but when it's displaying it, it adds another \ to indicate to you that it's escaped
When I'm in IRB and type \1, the value returned is \u0001 which is Ruby's way of
representing the character.
When I write puts('\1), the behavior is the same in IRB and when running
a script. I see a unicode character map as follows
0 0
0 1
This won't be the same output on all platforms (it depends on how unicode is
displayed). So that's probably why you see no output on the repl.it example.

Regular Expressions metacharacters, \< and \>, no longer supported in latest Ruby version?

When I investigated this in irb, I found that the metacharacters, \< and \>, returns nil when I expected a value. Under the cheatsheet I'm using, these metacharacters are called "start-of-word" and "end-of-word" respectively. But don't they function the same as "word boundaries"?
It seems to hold true for the examples in "Mastering Regular Expression" by Friedl.
irb(main):001:0> "this cat is fat" =~ /\bcat\b/
=> 5
irb(main):002:0> "this cat is fat" =~ /\<cat\>/
=> nil
irb(main):003:0> "cat" =~ /\<cat\>/
=> nil
That's entirely possible. As of Ruby 1.9, Ruby switched to Oniguruma for regular expression parsing. It's possible that prior to 1.9, \< and > were valid.
However, in researching this, I found that they are listed as a specifically GNU addition to the regex language.
Playing with it in Rubular, which supports running a regex through several different ruby implementations, I couldn't get \< or > to work in any version. \b seems to be the more standard way of specifying a word boundary...
Ruby has always used \b for word boundaries, just like Perl and the other Perl-derived flavors (JavaScript, .NET, Python, etc.). It's only GNU tools like egrep that use \< and \>.
Is that the AddedBytes cheat sheet you're using? It contains several errors, that being one of them. Elsewhere it says < and > are metacharacters and you have to use \< and \> to match them literally; how's that for a Catch-22?

Ruby Regex Different from Rubular

I'm trying the same super simple regex on Rubular.com and my VM Linux with Ruby 1.9.2 I dont' know why I'm getting different outputs:
VM:
my_str = "Madam Anita"
puts my_str[/\w/]
this Outputs: Madam
on Rubular it outputs: MadamAnita
Rubular:
http://www.rubular.com/r/qyQipItdes
I would love some help. I stuck here. I will not be able to test my code for the hw1.
No, it doesn't really. It matches all characters in "Madam" and "Anita", but not the space. The problem you are having is that my_str[/\w/] only returns a single match for the given regular expression, whereas Rubular highlights all possible matches.
If you need all occurrences, you could do this:
1.9.3p194 :002 > "Madam Anita".scan(/\w+/)
=> ["Madam", "", "Anita", ""]
Actually, \w matches a single character. The result in Rubular contains spaces between adjacent characters to tell you this (though I wish they'd also make the highlighting more obvious...). Compare with the output from matching \w+, which matches two strings (Madam and Anita).

Ruby: hexadecimal in regular expressions

I need to match an md5 checksum in a regular expression in a Ruby (actually Rails) program. I found out somewhere that I can match hexadecimal strings with \h sequence, but I can't find the link anymore.
I'm using that sequence and my code is working in Ruby 1.9.2. I can make it working even under plain IRB (so it's not a Rails extension).
ruby-1.9.2-p180 :007 > "123abcdf" =~ /^\h+$/; $~
=> #<MatchData "123abcdf">
ruby-1.9.2-p180 :008 > "123abcdfg" =~ /^\h+$/; $~
=> nil
However my IDE mark that expression as wrong and I can't find any reference which cites that sequence.
Is the \h sequence legal in Ruby Regex under any environment/version or should I trust my ide and replace it with something like [abcdef\d]?
Yes it is. Check the official doc for the complete documentation for regex in Ruby.
Note that \h will match uppercase letters too, so it's actually equivalent to [a-fA-F\d]
According to this \h is part of oniguruma, which I believe is standard in ruby 1.9.

Ruby RegEx problem text.gsub[^\W-], '') fails

I'm trying to learn RegEx in Ruby, based on what I'm reading in "The Rails Way". But, even this simple example has me stumped. I can't tell if it is a typo or not:
text.gsub(/\s/, "-").gsub([^\W-], '').downcase
It seems to me that this would replace all spaces with -, then anywhere a string starts with a non letter or number followed by a dash, replace that with ''. But, using irb, it fails first on ^:
syntax error, unexpected '^', expecting ']'
If I take out the ^, it fails again on the W.
>> text = "I love spaces"
=> "I love spaces"
>> text.gsub(/\s/, "-").gsub(/[^\W-]/, '').downcase
=> "--"
Missing //
Although this makes a little more sense :-)
>> text.gsub(/\s/, "-").gsub(/([^\W-])/, '\1').downcase
=> "i-love-spaces"
And this is probably what is meant
>> text.gsub(/\s/, "-").gsub(/[^\w-]/, '').downcase
=> "i-love-spaces"
\W means "not a word"
\w means "a word"
The // generate a regexp object
/[^\W-]/.class
=> Regexp
Step 1: Add this to your bookmarks. Whenever I need to look up regexes, it's my first stop
Step 2: Let's walk through your code
text.gsub(/\s/, "-")
You're calling the gsub function, and giving it 2 parameters.
The first parameter is /\s/, which is ruby for "create a new regexp containing \s (the // are like special "" for regexes).
The second parameter is the string "-".
This will therefore replace all whitespace characters with hyphens. So far, so good.
.gsub([^\W-], '').downcase
Next you call gsub again, passing it 2 parameters.
The first parameter is [^\W-]. Because we didn't quote it in forward-slashes, ruby will literally try run that code. [] creates an array, then it tries to put ^\W- into the array, which is not valid code, so it breaks.
Changing it to /[^\W-]/ gives us a valid regex.
Looking at the regex, the [] says 'match any character in this group. The group contains \W (which means non-word character) and -, so the regex should match any non-word character, or any hyphen.
As the second thing you pass to gsub is an empty string, it should end up replacing all the non-word characters and hyphens with empty string (thereby stripping them out )
.downcase
Which just converts the string to lower case.
Hope this helps :-)
You forgot the slashes. It should be /[^\W-]/
Well, .gsub(/[^\W-]/,'') says replace anything that's a not word nor a - for nothing.
You probably want
>> text.gsub(/\s/, "-").gsub(/[^\w-]/, '').downcase
=> "i-love-spaces"
Lower case \w (\W is just the opposite)
The slashes are to say that the thing between them is a regular expression, much like quotes say the thing between them is a string.

Resources