regex: question mark followed by colon as an alternative - ruby

In rails cucumber there is this regex
When /^(?:|I )go to (.+)$/ do |page_name|
I know ?: is a non-capturing group but what does it mean when it is there as an alternative separated by | ?

This isn't a special group, it just means "match nothing or I": http://www.rubular.com/r/H3iJFLXaab
This should be the same as writing (?:I )?
(or to be more precise, (?:I )?? - because the empty string has precedence over I, see also Is the lazy version of the 'optional' quantifier ('??') ever useful in a regular expression? )

Related

Regex to match certain conditions

Basically I want a regex to match this conditions
First 8 characters should be within [a-zA-Z]
Followed by any number of digits
Followed by any word character but not immediately folowed by "or" or "and"
I current have this regex:
^(?i:([a-z]{1,8})(\d+)((?!or|and).)+)$
this works fine for the following example:
ABCDEFGH1ZZZ
GFEDCBAH99ZZZ99
but NOT with this one because I think if satisfy "OR" in the "FORALL":
WOLRDWAR2FORALL
Expected output:
AAAAAAAA100NANDROID - should match
AAAAAAAA100ANDROID - should not match
AAAAAAAA100OR - should not match
AAAAAAAA100AND - should not match
Basically I don't want the FOR match the OR, any solution for my problem? btw, this is for Ruby
The problem with #anubhava regex and the others like it, is that
its too liberal using .* after the assertion.
That means it can split the expression before the assertion then
pick it up on the other side.
For example ^(?i:([a-z]{8})(\d+)((?!or|and).*))$ easily matches AAAAAAAA100AND
This is a rare case that causes the engine to backtrack a digit, to satisfy the assertion.
Usually, if .* were not used, it would be unnecessary to be concerned.
This can be fixed by injecting a \d* construct in the assertion.
Be aware that assertions are stand alone, they will match first then check if it should fail second. But this does not prevent the engine from backtracking if it can.
^(?i:([a-z]{8})(\d+)((?!\d*(?:or|and)).*))$
Expanded:
^
(?i:
( [a-z]{8} ) # (1)
( \d+ ) # (2)
( # (3 start)
(?!
\d*
(?: or | and )
)
.*
) # (3 end)
)
$
You can tweak your regex as:
/^(?i:([a-z]{8})(\d+)((?!or|and).*))$/
RegEx Demo
I think you are looking for this (I am using a positive look-behind (?<=\d) so that we only exclude or or and that are preceded by a digit):
^(?i:([a-z]{1,8})(\d+)((?!(?<=\d)(?:or|and)).)+)$
See demo
anubhava's answer seems to match the correct values, but all of the previous answers seem to include one or more capture groups, which I didn't see requested in your original post. Here's another possible solution that will match the entire string without groups:
^(?i:[a-z]{8}\d+(?!or|and).*)$
Rubular Demo

Ruby regex - gsub only captured group

I'm not quite sure I understand how non-capturing groups work. I am looking for a regex to produce this result: 5.214. I thought the regex below would work, but it is replacing everything including the non-capture groups. How can I write a regex to only replace the capture groups?
"5,214".gsub(/(?:\d)(,)(?:\d)/, '.')
# => ".14"
My desired result:
"5,214".gsub(some_regex)
#=> "5.214
non capturing groups still consumes the match
use
"5,214".gsub(/(\d+)(,)(\d+)/, '\1.\3')
or
"5,214".gsub(/(?<=\d+)(,)(?=\d+)/, '.')
You can't. gsub replaces the entire match; it does not do anything with the captured groups. It will not make any difference whether the groups are captured or not.
In order to achieve the result, you need to use lookbehind and lookahead.
"5,214".gsub(/(?<=\d),(?=\d)/, '.')
It is also possible to use Regexp.last_match (also available via $~) in the block version to get access to the full MatchData:
"5,214".gsub(/(\d),(\d)/) { |_|
match = Regexp.last_match
"#{match[1]}.#{match[2]}"
}
This scales better to more involved use-cases.
Nota bene, from the Ruby docs:
the ::last_match is local to the thread and method scope of the method that did the pattern match.
gsub replaces the entire match the regular expression engine produces. Both capturing/non-capturing group constructs are not retained. However, you could use lookaround assertions which do not "consume" any characters on the string.
"5,214".gsub(/\d\K,(?=\d)/, '.')
Explanation: The \K escape sequence resets the starting point of the reported match and any previously consumed characters are no longer included. That being said, we then look for and match the comma, and the Positive Lookahead asserts that a digit follows.
I know nothing about ruby.
But from what i see in the tutorial
gsub mean replace,
the pattern should be /(?<=\d+),(?=\d+)/ just replace the comma with dot
or, use capture /(\d+),(\d+)/ replace the string with "\1.\2"?
You can easily reference capture groups in the replacement string (second argument) like so:
"5,214".gsub(/(\d+)(,)(\d+)/, '\1.\3')
#=> "5.214"
\0 will return the whole matched string.
\1 will be replaced by the first capturing group.
\2 will be replaced by the second capturing group etc.
You could rewrite the example above using a non-capturing group for the , char.
"5,214".gsub(/(\d+)(?:,)(\d+)/, '\1.\2')
#=> "5.214"
As you can see, the part after the comma is now the second capturing group, since we defined the middle group as non-capturing.
Although it's kind of pointless in this case. You can just omit the capturing group for , altogether
"5,214".gsub(/(\d+),(\d+)/, '\1.\2')
#=> "5.214"
You don't need regexp to achieve what you need:
'1,200.00'.tr('.','!').tr(',','.').tr('!', ',')
Periods become bangs (1,200!00)
Commas become periods (1.200!00)
Bangs become commas (1.200,00)

What does `(?:| ...)` mean in a Ruby regular expression?

While reading Engineering long-lasting software: an Agile approach using SaaS and cloud computing I came across the following regex (Chapter 5, Section 5.3 Introducing Cucumber and Capybara):
/^(?:|I )am on (.+)$/
I know about the non-capturing (?: ...) syntax, but what I don’t understand is the meaning of the first pipe character after the colon. Is it a typo? Does it serve any particular purpose?
The pipe in regex means alternative. In this case, it is expressing alternation between an empty string "" and the string "I ".
It is just the or. It can match either nothing or I (with a space). The rest is non-capturing group like you mention.
The regex matches something like I am on a diet and also am on a diet and in the above examples, captures a diet in the first group.
Try it out on Rubular - http://rubular.com/r/q3RFEoxj1e
(?:|something)
("nothing / empty string or the match")
Is exactly the same thing as:
(?:something)?
("the match, once or none")
In other words: the non-capturing subpattern is optional.

Regex - Matching text AFTER certain characters

I want to scrape data from some text and dump it into an array. Consider the following text as example data:
| Example Data
| Title: This is a sample title
| Content: This is sample content
| Date: 12/21/2012
I am currently using the following regex to scrape the data that is specified after the 'colon' character:
/((?=:).+)/
Unfortunately this regex also grabs the colon and the space after the colon. How do I only grab the data?
Also, I'm not sure if I'm doing this right.. but it appears as though the outside parens causes a match to return an array. Is this the function of the parens?
EDIT: I'm using Rubular to test out my regex expressions
You could change it to:
/: (.+)/
and grab the contents of group 1. A lookbehind works too, though, and does just what you're asking:
/(?<=: ).+/
In addition to #minitech's answer, you can also make a 3rd variation:
/(?<=: ?)(.+)/
The difference here being, you create/grab the group using a look-behind.
If you still prefer the look-ahead rather than look-behind concept. . .
/(?=: ?(.+))/
This will place a grouping around your existing regex where it will catch it within a group.
And yes, the outside parenthesis in your code will make a match. Compare that to the latter example I gave where the entire look-ahead is 'grouped' rather than needlessly using a /( ... )/ without the /(?= ... )/, since the first result in most regular expression engines return the entire matched string.
I know you are asking for regex but I just saw the regex solution and found that it is rather hard to read for those unfamiliar with regex.
I'm also using Ruby and I decided to do it with:
line_as_string.split(": ")[-1]
This does what you require and IMHO it's far more readable.
For a very long string it might be inefficient. But not for this purpose.
In Ruby, as in PCRE and Boost, you may make use of the \K match reset operator:
\K keeps the text matched so far out of the overall regex match. h\Kd matches only the second d in adhd.
So, you may use
/:[[:blank:]]*\K.+/ # To only match horizontal whitespaces with `[[:blank:]]`
/:\s*\K.+/ # To match any whitespace with `\s`
Seee the Rubular demo #1 and the Rubular demo #2 and
Details
: - a colon
[[:blank:]]* - 0 or more horizontal whitespace chars
\K - match reset operator discarding the text matched so far from the overall match memory buffer
.+ - matches and consumes any 1 or more chars other than line break chars (use /m modifier to match any chars including line break chars).

How to use RegEx to replace items based on their context, without affecting the context

Using Ruby, I am writing a regular expression, and I need to be a able to remove any colon that appears between parentheses. I understand that I can use
"This is a (string :)".sub!(/\([^\)]*:/, '')
to do this, but the problem is that this function will also remove the context along with it. Is there any way to specify that I only want it to remove the colon and not the entire matching expression?
So some regular expression engines support what are called look-ahead and look-behind matches that will match but not consume characters. Ruby does support look-ahead, but not look-behind (which is more difficult to do in a performant way), which means you could quite easily stick with sub and remove a colon that precedes a closing parenthesis, but only without ensuring it is after an opening parenthesis:
string = 'This is a (string :)'
string.sub /:(?=\))/, ''
# => 'This is a (string )'
The alternative would be to use subpattern capturing (which happens automatically when you use grouping in your regular expression) to rebuild the string without the undesirable portion, in this case the colon:
string.sub /(\([^:]+):\)/, '\1)'
The \1 is a back-reference to what is matched in the first group, which is delimited by the parentheses that are not escaped. You can see here I didn't bother capturing the closing parenthesis in a second group, opting instead simply to include it in the substitution. This works well in this case because it will not change, but if you don't know that the colon will appear at the end of the parentheses-enclosed content, you would need a second group:
string.sub /(\([^:]+):([^)]+\))/, '\1\2'
The prior answer will mostly work for deleting single colons within paren groups, but have trouble with multiples like '(thing:foo:bar)`. It would be nice to use lookbehind and lookahead to make the within parens assertion, but ruby (and most regexp engines) doesn't support non-deterministic length patterns in lookbehind.
irb> s = 'x (a:b:c) : (1:2:3) y'
=> "x (a:b:c) : (1:2:3) y"
irb> s.gsub /(?<=\([^\(]*):(?=[^\)]*\))/, ''
SyntaxError: (irb):10: invalid pattern in look-behind: /(?<=\([^\(]*):(?=[^\)]*\))/
from /Users/dbenhur/.rbenv/versions/1.9.2-wp/bin/irb:12:in `<main>'
You could instead use the block form of gsub to capture paren enclosed groups, then remove colons from each match:
irb> s.gsub(/\([^\)]*\)/) {|m| m.delete ':'}
=> "x (abc) : (123) y"
in regex in general, you can use '(\()(:)(\))', \1\3.
I'm not familiar with Ruby. Basically what you do is you have 3 groups, and from this three groups ( : and ) you get rid of the second one, the :.
I tested it in Notepad++ and it works.
I think this is called: regex backreference
Cheers.
If you can assume all parentheses will come in balanced pairs like they do in your example, this should be all you need:
"This is a (string :)".gsub!(/:(?=[^()]*\))/, '')
If the lookahead succeeds in finding a closing paren without seeing an opening paren first, the colon must be inside a (...) sequence. Notice how I excluded the opening paren as well as the closing paren; that's essential.

Resources