What does `(?:| ...)` mean in a Ruby regular expression? - ruby

While reading Engineering long-lasting software: an Agile approach using SaaS and cloud computing I came across the following regex (Chapter 5, Section 5.3 Introducing Cucumber and Capybara):
/^(?:|I )am on (.+)$/
I know about the non-capturing (?: ...) syntax, but what I don’t understand is the meaning of the first pipe character after the colon. Is it a typo? Does it serve any particular purpose?

The pipe in regex means alternative. In this case, it is expressing alternation between an empty string "" and the string "I ".

It is just the or. It can match either nothing or I (with a space). The rest is non-capturing group like you mention.
The regex matches something like I am on a diet and also am on a diet and in the above examples, captures a diet in the first group.
Try it out on Rubular - http://rubular.com/r/q3RFEoxj1e

(?:|something)
("nothing / empty string or the match")
Is exactly the same thing as:
(?:something)?
("the match, once or none")
In other words: the non-capturing subpattern is optional.

Related

Ruby regex - gsub only captured group

I'm not quite sure I understand how non-capturing groups work. I am looking for a regex to produce this result: 5.214. I thought the regex below would work, but it is replacing everything including the non-capture groups. How can I write a regex to only replace the capture groups?
"5,214".gsub(/(?:\d)(,)(?:\d)/, '.')
# => ".14"
My desired result:
"5,214".gsub(some_regex)
#=> "5.214
non capturing groups still consumes the match
use
"5,214".gsub(/(\d+)(,)(\d+)/, '\1.\3')
or
"5,214".gsub(/(?<=\d+)(,)(?=\d+)/, '.')
You can't. gsub replaces the entire match; it does not do anything with the captured groups. It will not make any difference whether the groups are captured or not.
In order to achieve the result, you need to use lookbehind and lookahead.
"5,214".gsub(/(?<=\d),(?=\d)/, '.')
It is also possible to use Regexp.last_match (also available via $~) in the block version to get access to the full MatchData:
"5,214".gsub(/(\d),(\d)/) { |_|
match = Regexp.last_match
"#{match[1]}.#{match[2]}"
}
This scales better to more involved use-cases.
Nota bene, from the Ruby docs:
the ::last_match is local to the thread and method scope of the method that did the pattern match.
gsub replaces the entire match the regular expression engine produces. Both capturing/non-capturing group constructs are not retained. However, you could use lookaround assertions which do not "consume" any characters on the string.
"5,214".gsub(/\d\K,(?=\d)/, '.')
Explanation: The \K escape sequence resets the starting point of the reported match and any previously consumed characters are no longer included. That being said, we then look for and match the comma, and the Positive Lookahead asserts that a digit follows.
I know nothing about ruby.
But from what i see in the tutorial
gsub mean replace,
the pattern should be /(?<=\d+),(?=\d+)/ just replace the comma with dot
or, use capture /(\d+),(\d+)/ replace the string with "\1.\2"?
You can easily reference capture groups in the replacement string (second argument) like so:
"5,214".gsub(/(\d+)(,)(\d+)/, '\1.\3')
#=> "5.214"
\0 will return the whole matched string.
\1 will be replaced by the first capturing group.
\2 will be replaced by the second capturing group etc.
You could rewrite the example above using a non-capturing group for the , char.
"5,214".gsub(/(\d+)(?:,)(\d+)/, '\1.\2')
#=> "5.214"
As you can see, the part after the comma is now the second capturing group, since we defined the middle group as non-capturing.
Although it's kind of pointless in this case. You can just omit the capturing group for , altogether
"5,214".gsub(/(\d+),(\d+)/, '\1.\2')
#=> "5.214"
You don't need regexp to achieve what you need:
'1,200.00'.tr('.','!').tr(',','.').tr('!', ',')
Periods become bangs (1,200!00)
Commas become periods (1.200!00)
Bangs become commas (1.200,00)

Why won't a longer token in an alternation be matched?

I am using ruby 2.1, but the same thing can be replicated on rubular site.
If this is my string:
儘管中國婦幼衛生監測辦公室制定的
And I do a regex match with this expression:
(中國婦幼衛生監測辦公室制定|管中)
I am expecting to get the longer token as a match.
中國婦幼衛生監測辦公室制定
Instead I get the second alternation as a match.
As far as I know it does work like that when not in chinese characters.
If this is my string:
foobar
And I use this regex:
(foobar|foo)
Returned matching result is foobar. If the order is in the other way, than the matching string is foo. That makes sense to me.
Your assumption that regex matches a longer alternation is incorrect.
If you have a bit of time, let's look at how your regex works...
Quick refresher: How regex works: The state machine always reads from left to right, backtracking where necessary.
There are two pointers, one on the Pattern:
(cdefghijkl|bcd)
The other on your String:
abcdefghijklmnopqrstuvw
The pointer on the String moves from the left. As soon as it can return, it will:
(source: gyazo.com)
Let's turn that into a more "sequential" sequence for understanding:
(source: gyazo.com)
Your foobar example is a different topic. As I mentioned in this post:
How regex works: The state machine always reads from left to right. ,|,, == ,, as it always will only be matched to the first alternation.
    That's good, Unihedron, but how do I force it to the first alternation?
Look!*
^(?:.*?\Kcdefghijkl|.*?\Kbcd)
Here have a regex demo.
This regex first attempts to match the entire string with the first alternation. Only if it fails completely will it then attempt to match the second alternation. \K is used here to keep the match with the contents behind the construct \K.
*: \K was supported in Ruby since 2.0.0.
Read more:
The Stack Overflow Regex Reference
On greedy vs non-greedy
Ah, I was bored, so I optimized the regex:
^(?:(?:(?!cdefghijkl)c?[^c]*)++\Kcdefghijkl|(?:(?!bcd)b?[^b]*)++\Kbcd)
You can see a demo here.

Avoid combination of hyphen and space using a regex

I'm currently writing a very specific regex for a firstname field, that has several requirements. One of them is that spaces are not allowed before or after hyphens. For this, I have used a negative lookahead:
(?!.*(\s\-))
as part of the regex:
^(?!ß)(?!.*(\s\-))(?!(.)\1{2})(?!.*\s{2})(?!.*\'{2})(?!.*\-{2})[a-zA-ZßöüäÜÖÄ\s\-\']{2,30}(?<![\s\-])$
It does return a mismatch for:
asdf -asdf
but not for:
asdf- asdf
The latter also need to return an error. What am I missing?
You have to assert the other combination of hyphens and whitespaces absent in your string also:
(?!.*(\s\-))(?!.*(\-\s))
You can rewrite your pattern in a more simple way that avoids many problems and makes your pattern more efficient, example:
^(?=.{2,30}$)(?!(.)\1{2})[a-zA-ZöüäÜÖÄ]+(?:[-'\s][a-zA-ZßöüäÜÖÄ]+)*$
Simplest is probably a negative lookahead right after the ^:
/^(?!.*(\s-|-\s))#{main_pattern}/

regex: question mark followed by colon as an alternative

In rails cucumber there is this regex
When /^(?:|I )go to (.+)$/ do |page_name|
I know ?: is a non-capturing group but what does it mean when it is there as an alternative separated by | ?
This isn't a special group, it just means "match nothing or I": http://www.rubular.com/r/H3iJFLXaab
This should be the same as writing (?:I )?
(or to be more precise, (?:I )?? - because the empty string has precedence over I, see also Is the lazy version of the 'optional' quantifier ('??') ever useful in a regular expression? )

Regex to match all alphanumeric hashtags, no symbols

I am writing a hashtag scraper for facebook, and every regex I come across to get hashtags seems to include punctuation as well as alphanumeric characters. Here's an example of what I would like:
Hello #world! I am #m4king a #fac_book scraper and would like a nice regular #expression.
I would like it to match world, m4king, fac and expression (note that I would like it to cut off if it reaches punctuation, including spaces). It would be nice if it didn't include the hash symbol, but it's not super important.
Just incase it's important, I will be using ruby's string scan method to grab possibly more than one tag.
Thanks heaps in advance!
A regex such as this: #([A-Za-z0-9]+) should match what you need and place it in a capture group. You can then access this group later. Maybe this will help shed some light on regular expressions (from a Ruby context).
The regex above will start matching when it finds a # tag and will throw any following letters or numbers into a capture group. Once it finds anything which is not a letter or a digit, it will stop the matching. In the end you will end up with a group containing what you are after.
str = 'Hello #world! I am #m4king a #fac_book scraper and would like a nice regular #expression'
str.scan(/#([A-Za-z0-9]+)/).flatten #=> ["world", "m4king", "fac", "expression"]
The call to #flatten is needed because each capture group will be inside its own array.
Alternatively, you can use look-behind matching which will match alphanumeric characters only after a '#':
str.scan /(?<=#)[[:alnum:]]+/ #=> ["world", "m4king", "fac", "expression"]
Here's a simpler regex #[[:alnum:]_]/. Note it includes underscores because Facebook currently includes underscores as part of hashtags (as does twitter).
str = 'Hello #world! I am #m4king a #fac_book scraper and would like a nice regular #expression'
str.scan(/#[[:alnum:]_]+/)
Here's a view on Rubular:
http://rubular.com/r/XPPqwtVGN9

Resources