What is the difference between these three alternative ways to write Ruby regular expressions? - ruby

I want to match the path "/". I've tried the following alternatives, and the first two do match, but I don't know why the third doesn't:
/\A\/\z/.match("/") # <MatchData "/">
"/\A\/\z/".match("/") # <MatchData "/">
Regexp.new("/\A\/\z/").match("/") # nil
What's going on here? Why are they different?

The first snippet is the only correct one.
The second example is... misleading. That string literal "/\A\/\z/" is, obviously, not a regex. It's a string. Strings have #match method which converts its argument to a regexp (if not already one) and match against it. So, in this example, it's '/' that is the regular expression, and it matches a forward slash found in the other string.
The third line is completely broken: don't need the surrounding slashes there, they are part of regex literal, which you didn't use. Also use single quoted strings, not double quoted (which try to interpret escape sequences like \A)
Regexp.new('\A/\z').match("/") # => #<MatchData "/">
And, of course, none of the above is needed if you just want to check if a string consists of only one forward slash. Just use the equality check in this case.
s == '/'

Related

Ensure non-matching of a pattern within a scope

I am trying to create a regex that matches a pattern in some part of a string, but not in another part of the string.
I am trying to match a substring that
(i) is surrounded by a balanced pair of one or more consecutive backticks `
(ii) and does not include as many consecutive backticks as in the surrounding patterns
(iii) where the surrounding patterns (sequence of backticks) are not adjacent to other backticks.
This is some variant of the syntax of inline code notation in Markdown syntax.
Examples of matches are as follows:
"xxx`foo`yyy" # => matches "foo"
"xxx``foo`bar`baz``yyy" # => matches "foo`bar`baz"
"xxx```foo``bar``baz```yyy" # => matches "foo``bar``baz"
One regex to achieve this is:
/(?<!`)(?<backticks>`+)(?<inline>.+?)\k<backticks>(?!`)/
which uses a non-greedy match.
I was wondering if I can get rid of the non-greedy match.
The idea comes from when the prohibited pattern is a single character. When I want to match a substring that is surrounded by a single quote ' that does not include a single quote in it, I can do either:
/'.+?'/
/'[^']+'/
The first one uses non-greedy match, and the second one uses an explicit non-matching pattern [^'].
I am wondering if it is possible to have something like the second form when the prohibited pattern is not a single character.
Going back to the original issue, there is negative lookahead syntax(?!), but I cannot restrict its effective scope. If I make my regex like this:
/(?<!`)(?<backticks>`+)(?<inline>(?!.*\k<backticks>).*)\k<backticks>(?!`)/
then the effect of (?!.*\k<backticks>) will not be limited to within (?<inline>...), but will extend to the whole string. And since that contradicts with the \k<backticks> at the end, the regex fails to match.
Is there a regex technique to ensure non-matching of a pattern (not-necessarily a single character) within a certain scope?
You can search for one or more characters which aren't the first character of a delimiter:
/(?<!`)(?<backticks>`+)(?<inline>(?:(?!\k<backticks>).)+)\k<backticks>(?!`)/

CamelCase regexp not accounting for spaces

I created a regexp to match the following scenerios: SomethingCool, HelloWorld, MyNameIsDonato, etc. However, it does not account for spaces:
> 'Something Cooler' =~ /([A-Z][a-z0-9]+)+/
=> 0
That passes and it should not pass. A space is not an alphanumeric character. So why does this pass and how can I fix it?
You need to anchor the regex to the beginning and end of the string, or it will just match one of the words:
^([A-Z][a-z0-9]+)+$
^ and $ anchor the beginnings and ends of lines, respectively. To anchor to the beginning and end of the string, use \A and \Z.
It's worth noting that this is useless if you're trying to find camelcase names within a larger string. For that, use your original regex.

What does (?m:\s*) mean in Regex jargon?

What would this mean in an expression?
(?m:.*?)
or this
(?m:\s*)
I mean, it appears to be something to do with whitespace but I'm unsure.
ADDITIONAL DETAILS:
The full expression I'm looking at is:
\A((?m:\s*)((\/\*(?m:.*?)\*\/)|(\#\#\# (?m:.*?)\#\#\#)|(\/\/ .* \n?)+|(\# .* \n?)+))+
(?...) is a way of applying modifiers to the regular expression inside the parentheses.
(?:...) allows you to treat the part between the parentheses as a group, without affecting the set of strings captured by the matching engine. But you can add option letters between the ? and the :, in which case the part of the regular expression between the parentheses behaves as if you had included those option letters when creating the regular expression. That is, /(?m:...)/ behaves the same as /.../m.
The m, in turn, enables "multiline" mode.
CORRECTED:
Here's where I got confused in the original answer, because this option has different meanings in different environments.
This question is tagged Ruby, in which "multiline mode" causes the dot character (.) to match newlines, whereas normally that's the one character it doesn't match:
irb(main):001:0> "a\nb" =~ /a.b/
=> nil
irb(main):002:0> "a\nb" =~ /a.b/m
=> 0
irb(main):003:0> "a\nb" =~ /(?m:a.b)/
=> 0
So your first regular expression, (?m:.*?) will match any number (including zero) of any characters (including newlines). Basically, it will match anything at all, including nothing.
In the second regular expression, (?m:\s*), the modifier has no effect at all because there are no dots in the contained expression to modify.
Back to the first expression. As Ωmega says, the ? after the * means that it is a non-greedy match. If that were the whole expression, or if there were no captures, it wouldn't matter. But when something follows that section and there are captures, you get different results. Without the ?, the longest possible match wins:
irb(main):001:0> /<(.*)>/.match("<a><b>")[1]
=> "a><b"
With the ?, you get the shortest one instead:
irb(main):002:0> /<(.*?)>/.match("<a><b>")[1]
=> "a"
Finally, about the above-mentioned /m confusion (though if you want to avoid becoming confused yourself, this might be a good place to stop reading):
In Perl 5 (which is the source of most regular expression extensions beyond the basic syntax), the behavior triggered by /m in Ruby is instead triggered by the /s option (which Ruby doesn't have, though if you put one on your regex it will silently ignore it). In Perl, /m, despite still being called "multiline mode", has a completely different effect: it causes the ^ and $ anchors to match at newlines within the string as well as at the beginning and end of the whole string respectively. But in Ruby, that behavior is the default, and there's not even an option to change it.
Pattern .*? will match any string, but as short string as possible, as there is a lazy operator ?.
Pattern \s* will match white-space characters (zero of more).
(?m) enables "multi-line mode". In this mode, the caret and dollar match before and after newlines in the subject string. To apply this mode to some sub-pattern only, sytax (?m:...) is used, where ... is a matching pattern.
For more information read http://www.regular-expressions.info/modifiers.html

How to match anything EXCEPT this string?

How can I match a string that is NOT partners?
Here is what I have that matches partners:
/^partners$/i
I've tried the following to NOT match partners but doesn't seem to work:
/^(?!partners)$/i
Your regex
/^(?!partners)$/i
only matches empty lines because you didn't include the end-of-line anchor in your lookahead assertion. Lookaheads do just that - they "look ahead" without actually matching any characters, so only lines that match the regex ^$ will succeed.
This would work:
/^(?!partners$)/i
This reports a match with any string (or, since we're in Ruby here, any line in a multi-line string) that's different from partners. Note that it only matches the empty string at the start of the line. Which is enough for validation purposes, but the match result will be "" (instead of nil which you'd get if the match failed entirely).
not easily but with the look ahead operator it can.
Here the ruby regex
^((?!partners).)*$
Cheers
If you only want to get a true value when string is not partners then there is no need to use regex and you can just use a string comparison (which ignores case).
If you for some reason need a positive regex match for any string which does not contain partners (if it's a part of a larger regex for example) you could use several different constructs, like:
`^(?:(?!partners).)*$`
or
^(?:[^p]+|p(?!artners))*$
For example, in Java:
!"partners".equalsIgnoreCase(aString)

Replacing partial regex matches in place with Ruby

I want to transform the following text
This is a ![foto](foto.jpeg), here is another ![foto](foto.png)
into
This is a ![foto](/folder1/foto.jpeg), here is another ![foto](/folder2/foto.png)
In other words I want to find all the image paths that are enclosed between brackets (the text is in Markdown syntax) and replace them with other paths. The string containing the new path is returned by a separate real_path function.
I would like to do this using String#gsub in its block version. Currently my code looks like this:
re = /!\[.*?\]\((.*?)\)/
rel_content = content.gsub(re) do |path|
real_path(path)
end
The problem with this regex is that it will match ![foto](foto.jpeg) instead of just foto.jpeg. I also tried other regexen like (?>\!\[.*?\]\()(.*?)(?>\)) but to no avail.
My current workaround is to split the path and reassemble it later.
Is there a Ruby regex that matches only the path inside the brackets and not all the contextual required characters?
Post-answers update: The main problem here is that Ruby's regexen have no way to specify zero-width lookbehinds. The most generic solution is to group what the part of regexp before and the one after the real matching part, i.e. /(pre)(matching-part)(post)/, and reconstruct the full string afterwards.
In this case the solution would be
re = /(!\[.*?\]\()(.*?)(\))/
rel_content = content.gsub(re) do
$1 + real_path($2) + $3
end
A quick solution (adjust as necessary):
s = 'This is a ![foto](foto.jpeg)'
s.sub!(/!(\[.*?\])\((.*?)\)/, '\1(/folder1/\2)' )
p s # This is a [foto](/folder1/foto.jpeg)
You can always do it in two steps - first extract the whole image expression out and then second replace the link:
str = "This is a ![foto](foto.jpeg), here is another ![foto](foto.png)"
str.gsub(/\!\[[^\]]*\]\(([^)]*)\)/) do |image|
image.gsub(/(?<=\()(.*)(?=\))/) do |link|
"/a/new/path/" + link
end
end
#=> "This is a ![foto](/a/new/path/foto.jpeg), here is another ![foto](/a/new/path/foto.png)"
I changed the first regex a bit, but you can use the same one you had before in its place. image is the image expression like ![foto](foto.jpeg), and link is just the path like foto.jpeg.
[EDIT] Clarification: Ruby does have lookbehinds (and they are used in my answer):
You can create lookbehinds with (?<=regex) for positive and (?<!regex) for negative, where regex is an arbitrary regex expression subject to the following condition. Regexp expressions in lookbehinds they have to be fixed width due to limitations on the regex implementation, which means that they can't include expressions with an unknown number of repetitions or alternations with different-width choices. If you try to do that, you'll get an error. (The restriction doesn't apply to lookaheads though).
In your case, the [foto] part has a variable width (foto can be any string) so it can't go into a lookbehind due to the above. However, lookbehind is exactly what we need since it's a zero-width match, and we take advantage of that in the second regex which only needs to worry about (fixed-length) compulsory open parentheses.
Obviously you can put real_path in from here, but I just wanted a test-able example.
I think that this approach is more flexible and more readable than reconstructing the string through the match group variables
In your block, use $1 to access the first capture group ($2 for the second and so on).
From the documentation:
In the block form, the current match string is passed in as a parameter, and variables such as $1, $2, $`, $&, and $' will be set appropriately. The value returned by the block will be substituted for the match on each call.
As a side note, some people think '\1' inappropriate for situations where an unconfirmed number of characters are matched. For example, if you want to match and modify the middle content, how can you protect the characters on both sides?
It's easy. Put a bracket around something else.
For example, I hope replace a-ruby-porgramming-book-531070.png to a-ruby-porgramming-book.png. Remove context between last "-" and last ".".
I can use /.*(-.*?)\./ match -531070. Now how should I replace it? Notice
everything else does not have a definite format.
The answer is to put brackets around something else, then protect them:
"a-ruby-porgramming-book-531070.png".sub(/(.*)(-.*?)\./, '\1.')
# => "a-ruby-porgramming-book.png"
If you want add something before matched content, you can use:
"a-ruby-porgramming-book-531070.png".sub(/(.*)(-.*?)\./, '\1-2019\2.')
# => "a-ruby-porgramming-book-2019-531070.png"

Resources