Ruby Regular expression: Not able to find word from string - ruby

I am trying to find word "the" that has space before character "t" and after character "e" from string "the the the the" . i am using below regular expression but it is giving me only one word "the" instead of two word 'the'.
s="the the the the"
s.scan(/\sthe\s/)
output - [" the "]
I was expecting expression to return tow middle word "the". why this is happening.

The problem here is that \s patterns consume the whitespace. The scan method only matches non-overlapping matches, and your expected matches are overlapping.
You need to use looakrounds to get overlapping matches:
/(?<=\s)the(?=\s)/
See the regex demo and a Ruby demo where puts s.scan(/(?<=\s)the(?=\s)/) prints 2 the instances.
Pattern details:
(?<=\s) - a positive lookbehind that requires a whitespace to be present immediately before the the
the - a literal text the
(?=\s) - a positive lookahead that requires a whitespace right after the the.
Note that if you use \bthe\b (i.e. use word boundaries), you will get all the instances from your string as \b just asserts the position before or after a word char (letter, digit or underscore).

Related

regexp match group with the exception of a member of the group

So, there are a number of regular expression which matches a particular group like the following:
/./ - Any character except a newline.
/./m - Any character (the m modifier enables multiline mode)
/\w/ - A word character ([a-zA-Z0-9_])
/\s/ - Any whitespace character
And in ruby:
/[[:punct:]]/ - Punctuation character
/[[:space:]]/ - Whitespace character ([:blank:], newline, carriage return, etc.)
/[[:upper:]]/ - Uppercase alphabetical
So, here is my question: how do I get a regexp to match a group like this, but exempt a character out?
Examples:
match all punctuations apart from the question mark
match all whitespace characters apart from the new line
match all words apart from "go"... etc
Thanks.
You can use character class subtraction.
Rexegg:
The syntax […&&[…]] allows you to use a logical AND on several character classes to ensure that a character is present in them all. Intersecting with a negated character, as in […&&[^…]] allows you to subtract that class from the original class.
Consider this code:
s = "./?!"
res = s.scan(/[[:punct:]&&[^!]]/)
puts res
Output is only ., / and ? since ! is excluded.
Restricting with a lookahead (as sawa has written just now) is also possible, but is not required when you have this subtraction supported. When you need to restrict some longer values (more than 1 character) a lookahead is required.
In many cases, a lookahead must be anchored to a word boundary to return correct results. As an example of using a lookahead to restrict punctuation (single character matching generic pattern):
/(?:(?!!)[[:punct:]])+/
This will match 1 or more punctuation symbols but a !.
The puts "./?!".scan(/(?:(?!!)[[:punct:]])+/) code will output ./? (see demo)
Use character class subtraction whenever you need to restrict with single characters, it is more efficient than using lookaheads.
So, the 3rd scenario regex must look like:
/\b(?!go\b)\w+\b/
^^
If you write /(?!\bgo\b)\b\w+\b/, the regex engine will check each position in the input string. If you use a \b at the beginning, only word boundary positions will be checked, and the pattern will yield better performance. Also note that the ^^ \b is very important since it makes the regex engine check for the whole word go. If you remove it, it will only restrict to the words that do not start with go.
Put what you want to exclude inside a negative lookahead in front of the match. For example,
To match all punctuations apart from the question mark,
/(?!\?)[[:punct:]]/
To match all words apart from "go",
/(?!\bgo\b)\b\w+\b/
This is a general approach that is sometimes useful:
a = []
".?!,:;-".scan(/[[:punct:]]/) { |s| a << s unless s == '?' }
a #=> [".", "!", ",", ":", ";", "-"]
The content of the block is limited only by your imagination.

Regex matching plus or minus

Could someone please look at the following function and explain the regex for me as I don't understand it and I don't like using something I don't understand as then I won't be able to replicate it for use in the future and nor do I learn from it.
Also can someone explain the double !! in front, I know single means not so does double mean not "not"?
The function is a patch to String to check if it's capable of being converted to an integer or not.
class String
def is_i?
!!(self =~ /\A[-+]?[0-9]+\z/)
end
end
The main thing that's giving me trouble is [-+] as it makes little sense to me, if you could explain in the context given it would be very helpful.
EDIT:
Since people missed the second part of the question I'll be a little more explicit.
What does !! Mean in front of the check, I know a single ! means NOT but I can't find what !! means.
The [-+] Character Class
[-+] is a character class. It means "match one character specified by the class", i.e. - or +.
Hyphens in Character Classes
I can see how this particular class can be confusing because the hyphen often plays a special role in a character class: it links two characters to form a character range. For instance, [a-z] means "match one character between a and z, and [a-z0-9] means "match one character between a and z or between 0 and 9.
However, in this case, the hypen in [-+] is positioned in a place where it cannot be used to specify a range, and the - is just a literal hyphen.
Decoding the entire expression
Assert position at the beginning of the string \A
Match a single character from the list “-+” [-+]?
Between zero and one times, as many times as possible, giving back as needed (greedy) ?
Match a single character in the range between “0” and “9” [0-9]+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
Assert position at the very end of the string \z
A Character Class defines a set of characters, any one of which can occur in a string for a match to succeed.
For example, the regular expression [-+]?[0-9]+ will match 123, -123, or +123 because it defines a character class (accepting either -, +, or neither one) as its first character.
In context:
\A asserts position at the start of the string.
[-+] any character of: - or + (? optional, meaning between zero and one time)
[0-9] any character of: 0 to 9 (+ quantifier meaning 1 or more times)
\z asserts position at the very end of the string.
What does !! mean?
!! placed together converts the value to a boolean.
explain the regex for me as I don't understand it
Pattern explanation: \A[-+]?[0-9]+\z
\A Start of string
[-+]? plus or minus sign [zero or one time (optional)]
[0-9]+ 0 to 9 any digit [one or more times]
\z End of string
The above regex pattern is able to match any positive and negative integer number that has + or - sign optional.
Read more about Character Classes and test your regex pattern online at Rubular

Why won't my regex lookback work on a URL using Ruby 1.9?

I would like to have this regex:
.match(/wtflungcancer.com\/\S*(?<!js)/i)
NOT match the following string based on the fact that 'js' is present. However, the following matches the entire URL:
"http://www.wtflungcancer.com/wp-content/plugins/contact-form-7/includes/js/jquery.form.min.js?ver=3.32.0-2013.04.03".match(/wtflungcancer.com\/\S*(?<!js)/i)
This happens because \S* eats all the characters, so the lookbehind is never activated.
Something like this should work:
/wtflungcancer.com(?!\S*\.js)/i
Basically
do not let the * consume all characters
instead of using a lookbehind, use a lookahead
search for strings containing wtflungcancer.com NOT followed by a string containing ".js"
-- EDIT: more explanation added --
What is the difference between
"wtflungcancer.com\S*(?<!\.js)"
and
"wtflungcancer.com(?!\S*\.js)"
They look really similar!
Lookarounds (lookahead and lookbehind) in regular expressions tell the regexp engine when a match is correct or not: they do not consume characters of the string.
Especially lookbehinds tell the regexp engine to look backwards, in your case the lookbehind wasn't anchored on the right side, so the "\S*" just consumed all the non whitespace characters in the string.
For example, this regexp can work for finding url NOT ending with ".js":
wtflungcancer.com\S+(?<!\.js)$
See? The right side of the lookbehind is anchored using the end of string metacharacter.
In our case, though we couldn't hook anything to the right side, so I switched from lookbehind to lookahead
So, the real regular expression just matches "wtflungcancer.com": at that point, the lookahead tells the regexp engine: "In order for this match to be correct, this string must not be followed by a sequence of non-whitespace characters followed by '.js'". This works because lookaheads do not consume actual characters, they just move on character by character to see if the match is good or not.
You can try with this pattern:
wtflungcancer.com\/(?>[^\s.]++|\.++(?!js))*(?!\.)
Explanations:
The goal is to allow all characters that are not a space or a dot followed by js:
(?> # open an atomic group
[^\s.]++ # all characters but white characters and .
| # OR
\.++(?!js) # . not followed by js
)* # close the atomic group, repeat zero or more times
To be sure that your pattern check all the url string, i add a lookahead that check if a dot don't follow.

Why does having a literal space between regex tokens lead to different matchdata objects?

For example, consider the following expressions:
no_space = "This is a test".match(/(\w+)(\w+)/)
with_space = "This is a test".match(/(\w+) (\w+)/)
The expression no_space is now the matchdata object #<MatchData "This" 1:"Thi" 2:"s">, while with_space is #<MatchData "This is" 1:"This" 2:"is">. What is going on here? It seems to me like the literal space between tokens indicates to ruby that it should match multiple words if possible, while not having a space causes the match to be limited to one word. Any explanation or clarification on the subject would be appreciated.
Thanks.
\w doesn't match space, and + is greedy unless you follow it by ?, so Ruby tries to match as many \w as possible, as long as the rest of the express also matches, effectively consuming Thi in the first capture, and s in the second.
When you add a space, Ruby matches as many \w until a space character, and then as many \w, therefore matching This and is.
Please let me know if this isn't clear.
With the regular expression /(\w+)(\w+)/, the only characters that can be matched are word characters (letters, digits, and underscores). A regular expression will only ever match consecutive characters in a string, so unless you include something in the regular expression to match the spaces between words the regex can't match more than a single word.

regular expression back referencing

why this snippet:
'He said "Hello"' =~ /(\w)\1/
matches "ll"? I thought that the \w part matches "H", and hence \1 refers to "H", thus nothing should be matched? but why this result?
I thought that the \w part matches "H"
\w matches any alphanumerical character (and underscore). It also happens to match H but that’s not terribly interesting since the regular expression then goes on to say that this has to be matched twice – which H can’t in your text (since it doesn’t appear twice consecutively), and neither is any of the other characters, just l. So the regular expression matches ll.
You're thinking of /^(\w)\1/. The caret symbol specifies that the match must start at the beginning of the line. Without that, the match can start anywhere in the string (it will find the first match).
and you're right, nothing was matched at that position. then regex went further and found match, which it returned to you.
\w is of course matches any word character, not just 'H'.
The point is, "\1" means one repetition of the "(\w)" block, only the letter "l" is doubled and will match your regex.
A nice page for toying around with ruby and regular expressions is Rubular

Resources