regular expression back referencing - ruby

why this snippet:
'He said "Hello"' =~ /(\w)\1/
matches "ll"? I thought that the \w part matches "H", and hence \1 refers to "H", thus nothing should be matched? but why this result?

I thought that the \w part matches "H"
\w matches any alphanumerical character (and underscore). It also happens to match H but that’s not terribly interesting since the regular expression then goes on to say that this has to be matched twice – which H can’t in your text (since it doesn’t appear twice consecutively), and neither is any of the other characters, just l. So the regular expression matches ll.

You're thinking of /^(\w)\1/. The caret symbol specifies that the match must start at the beginning of the line. Without that, the match can start anywhere in the string (it will find the first match).

and you're right, nothing was matched at that position. then regex went further and found match, which it returned to you.
\w is of course matches any word character, not just 'H'.

The point is, "\1" means one repetition of the "(\w)" block, only the letter "l" is doubled and will match your regex.
A nice page for toying around with ruby and regular expressions is Rubular

Related

What does "1\/1." mean in Ruby?

I am learning Ruby and I have something to match with (/^1\/1. Guess a word from an anagram [RUBY]{4}$/)
Please, what does "1\/1." mean in this expression. Can anyone explain what's going on for me.
Thanks
Generally speaking, a backslash in a regular expression escapes the next character, so that it's treated as an ordinary character rather than whatever its special meaning would be. For instance a* matches zero or more of the letter a, but a\* matches, literally, an a followed by a star. Since most regular expressions in Ruby are wrapped in the delimiter /, we can't directly put forward slashes in our regex. If we had written
/^1/1. Guess a word from an anagram [RUBY]{4}$/
Then the regex would be /^1/ and the rest of the line would be a very confusing syntax error. This is for the same reasons that we can't put " characters directly inside of a "-delimited string.
So a backslash treats it as an actual slash in the expression rather than a delimiter.
/^1\/1. Guess a word from an anagram [RUBY]{4}$/
We're literally matches a 1 followed by a slash followed by a 1 at the start of the line.

How do I tune this regex to return the matches I want?

So I have a string that looks like this:
#jackie#test.com, #mike#test.com
What I want to do is before any email in this comma separated list, I want to remove the #. The issue I keep running into is that if I try to do a regular \A flag like so /[\A#]+/, it finds all the instances of # in that string...including the middle crucial #.
The same thing happens if I do /[\s#]+/. I can't figure out how to just look at the beginning of each string, where each string is a complete email address.
Edit 1
Note that all I need is the regex, I already have the rest of the stuff I need to do what I want. Specifically, I am achieving everything else like this:
str.gsub(/#/, '').split(',').map(&:strip)
Where str is my string.
All I am looking for is the regex portion for my gsub.
You may use the below negative lookbehind based regex.
str.gsub(/(?<!\S)#/, '').split(',').map(&:strip)
(?<!\S) Negative lookbehind asserts that the character or substring we are going to match would be preceeded by any but not of a non-space character. So this matches the # which exists at the start or the # which exists next to a space character.
Difference between my answer and hwnd's str.gsub(/\B#/, '') is, mine won't match the # which exists in :# but hwnd's answer does. \B matches between two word characters or two non-word characters.
Here is one solution
str = "#jackie#test.com, #mike#test.com"
p str.split(/,[ ]+/).map{ |i| i.gsub(/^#/, '')}
Output
["jackie#test.com", "mike#test.com"]

Regular expression help to skip first occurrence of a special character while allowing for later special chars but no whitespace

I'm looking for words starting with a hashtag: "#yolo"
My regex for this was very simple: /#\w+/
This worked fine until I hit words that ended with a question mark: "#yolo?".
I updated my regex to allow for words and any non whitespace character as well: /#[\w\S]*/.
The problem is I sometimes need to pull a match from a word starting with two '#' characters, up until whitespace, that may contain a special character in it or at the end of the word (which I need to capture).
Example:
"##yolo?"
And I would like to end up with:
"#yolo?"
Note: the regular expressions are for Ruby.
P.S. I'm testing these out here: http://rubular.com/
Maybe this would work
#(#?[\S]+)
What about
#[^#\s]+
\w is a subset of ^\s (i.e. \S) so you don't need both. Also, I assume you don't want any more #s in the match, so we use [^#\s] which negates both whitespace and # characters.

Why does having a literal space between regex tokens lead to different matchdata objects?

For example, consider the following expressions:
no_space = "This is a test".match(/(\w+)(\w+)/)
with_space = "This is a test".match(/(\w+) (\w+)/)
The expression no_space is now the matchdata object #<MatchData "This" 1:"Thi" 2:"s">, while with_space is #<MatchData "This is" 1:"This" 2:"is">. What is going on here? It seems to me like the literal space between tokens indicates to ruby that it should match multiple words if possible, while not having a space causes the match to be limited to one word. Any explanation or clarification on the subject would be appreciated.
Thanks.
\w doesn't match space, and + is greedy unless you follow it by ?, so Ruby tries to match as many \w as possible, as long as the rest of the express also matches, effectively consuming Thi in the first capture, and s in the second.
When you add a space, Ruby matches as many \w until a space character, and then as many \w, therefore matching This and is.
Please let me know if this isn't clear.
With the regular expression /(\w+)(\w+)/, the only characters that can be matched are word characters (letters, digits, and underscores). A regular expression will only ever match consecutive characters in a string, so unless you include something in the regular expression to match the spaces between words the regex can't match more than a single word.

Dot operator in negative bracket expression

The Ruby in Tim Bray's Wide Finder benchmark (http://wikis.sun.com/display/WideFinder/The+Benchmark) has this line:
%r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
I've been using regexes for a long time, but I'm not sure what the point of the "." is. It seems to match on anything that's not a space, but [^ ] would do that anyway.
When I first looked at it, it looked to me like it would match on nothing except possibly a line break.
Can anybody explain the behavior of this expression?
[^ .] means match any single character apart from a space or a literal period. The period does not have a special meaning when inside square brackets.

Resources