So, there are a number of regular expression which matches a particular group like the following:
/./ - Any character except a newline.
/./m - Any character (the m modifier enables multiline mode)
/\w/ - A word character ([a-zA-Z0-9_])
/\s/ - Any whitespace character
And in ruby:
/[[:punct:]]/ - Punctuation character
/[[:space:]]/ - Whitespace character ([:blank:], newline, carriage return, etc.)
/[[:upper:]]/ - Uppercase alphabetical
So, here is my question: how do I get a regexp to match a group like this, but exempt a character out?
Examples:
match all punctuations apart from the question mark
match all whitespace characters apart from the new line
match all words apart from "go"... etc
Thanks.
You can use character class subtraction.
Rexegg:
The syntax […&&[…]] allows you to use a logical AND on several character classes to ensure that a character is present in them all. Intersecting with a negated character, as in […&&[^…]] allows you to subtract that class from the original class.
Consider this code:
s = "./?!"
res = s.scan(/[[:punct:]&&[^!]]/)
puts res
Output is only ., / and ? since ! is excluded.
Restricting with a lookahead (as sawa has written just now) is also possible, but is not required when you have this subtraction supported. When you need to restrict some longer values (more than 1 character) a lookahead is required.
In many cases, a lookahead must be anchored to a word boundary to return correct results. As an example of using a lookahead to restrict punctuation (single character matching generic pattern):
/(?:(?!!)[[:punct:]])+/
This will match 1 or more punctuation symbols but a !.
The puts "./?!".scan(/(?:(?!!)[[:punct:]])+/) code will output ./? (see demo)
Use character class subtraction whenever you need to restrict with single characters, it is more efficient than using lookaheads.
So, the 3rd scenario regex must look like:
/\b(?!go\b)\w+\b/
^^
If you write /(?!\bgo\b)\b\w+\b/, the regex engine will check each position in the input string. If you use a \b at the beginning, only word boundary positions will be checked, and the pattern will yield better performance. Also note that the ^^ \b is very important since it makes the regex engine check for the whole word go. If you remove it, it will only restrict to the words that do not start with go.
Put what you want to exclude inside a negative lookahead in front of the match. For example,
To match all punctuations apart from the question mark,
/(?!\?)[[:punct:]]/
To match all words apart from "go",
/(?!\bgo\b)\b\w+\b/
This is a general approach that is sometimes useful:
a = []
".?!,:;-".scan(/[[:punct:]]/) { |s| a << s unless s == '?' }
a #=> [".", "!", ",", ":", ";", "-"]
The content of the block is limited only by your imagination.
Related
I am trying to find word "the" that has space before character "t" and after character "e" from string "the the the the" . i am using below regular expression but it is giving me only one word "the" instead of two word 'the'.
s="the the the the"
s.scan(/\sthe\s/)
output - [" the "]
I was expecting expression to return tow middle word "the". why this is happening.
The problem here is that \s patterns consume the whitespace. The scan method only matches non-overlapping matches, and your expected matches are overlapping.
You need to use looakrounds to get overlapping matches:
/(?<=\s)the(?=\s)/
See the regex demo and a Ruby demo where puts s.scan(/(?<=\s)the(?=\s)/) prints 2 the instances.
Pattern details:
(?<=\s) - a positive lookbehind that requires a whitespace to be present immediately before the the
the - a literal text the
(?=\s) - a positive lookahead that requires a whitespace right after the the.
Note that if you use \bthe\b (i.e. use word boundaries), you will get all the instances from your string as \b just asserts the position before or after a word char (letter, digit or underscore).
So I have a string that looks like this:
#jackie#test.com, #mike#test.com
What I want to do is before any email in this comma separated list, I want to remove the #. The issue I keep running into is that if I try to do a regular \A flag like so /[\A#]+/, it finds all the instances of # in that string...including the middle crucial #.
The same thing happens if I do /[\s#]+/. I can't figure out how to just look at the beginning of each string, where each string is a complete email address.
Edit 1
Note that all I need is the regex, I already have the rest of the stuff I need to do what I want. Specifically, I am achieving everything else like this:
str.gsub(/#/, '').split(',').map(&:strip)
Where str is my string.
All I am looking for is the regex portion for my gsub.
You may use the below negative lookbehind based regex.
str.gsub(/(?<!\S)#/, '').split(',').map(&:strip)
(?<!\S) Negative lookbehind asserts that the character or substring we are going to match would be preceeded by any but not of a non-space character. So this matches the # which exists at the start or the # which exists next to a space character.
Difference between my answer and hwnd's str.gsub(/\B#/, '') is, mine won't match the # which exists in :# but hwnd's answer does. \B matches between two word characters or two non-word characters.
Here is one solution
str = "#jackie#test.com, #mike#test.com"
p str.split(/,[ ]+/).map{ |i| i.gsub(/^#/, '')}
Output
["jackie#test.com", "mike#test.com"]
I would like to have this regex:
.match(/wtflungcancer.com\/\S*(?<!js)/i)
NOT match the following string based on the fact that 'js' is present. However, the following matches the entire URL:
"http://www.wtflungcancer.com/wp-content/plugins/contact-form-7/includes/js/jquery.form.min.js?ver=3.32.0-2013.04.03".match(/wtflungcancer.com\/\S*(?<!js)/i)
This happens because \S* eats all the characters, so the lookbehind is never activated.
Something like this should work:
/wtflungcancer.com(?!\S*\.js)/i
Basically
do not let the * consume all characters
instead of using a lookbehind, use a lookahead
search for strings containing wtflungcancer.com NOT followed by a string containing ".js"
-- EDIT: more explanation added --
What is the difference between
"wtflungcancer.com\S*(?<!\.js)"
and
"wtflungcancer.com(?!\S*\.js)"
They look really similar!
Lookarounds (lookahead and lookbehind) in regular expressions tell the regexp engine when a match is correct or not: they do not consume characters of the string.
Especially lookbehinds tell the regexp engine to look backwards, in your case the lookbehind wasn't anchored on the right side, so the "\S*" just consumed all the non whitespace characters in the string.
For example, this regexp can work for finding url NOT ending with ".js":
wtflungcancer.com\S+(?<!\.js)$
See? The right side of the lookbehind is anchored using the end of string metacharacter.
In our case, though we couldn't hook anything to the right side, so I switched from lookbehind to lookahead
So, the real regular expression just matches "wtflungcancer.com": at that point, the lookahead tells the regexp engine: "In order for this match to be correct, this string must not be followed by a sequence of non-whitespace characters followed by '.js'". This works because lookaheads do not consume actual characters, they just move on character by character to see if the match is good or not.
You can try with this pattern:
wtflungcancer.com\/(?>[^\s.]++|\.++(?!js))*(?!\.)
Explanations:
The goal is to allow all characters that are not a space or a dot followed by js:
(?> # open an atomic group
[^\s.]++ # all characters but white characters and .
| # OR
\.++(?!js) # . not followed by js
)* # close the atomic group, repeat zero or more times
To be sure that your pattern check all the url string, i add a lookahead that check if a dot don't follow.
I'm looking for words starting with a hashtag: "#yolo"
My regex for this was very simple: /#\w+/
This worked fine until I hit words that ended with a question mark: "#yolo?".
I updated my regex to allow for words and any non whitespace character as well: /#[\w\S]*/.
The problem is I sometimes need to pull a match from a word starting with two '#' characters, up until whitespace, that may contain a special character in it or at the end of the word (which I need to capture).
Example:
"##yolo?"
And I would like to end up with:
"#yolo?"
Note: the regular expressions are for Ruby.
P.S. I'm testing these out here: http://rubular.com/
Maybe this would work
#(#?[\S]+)
What about
#[^#\s]+
\w is a subset of ^\s (i.e. \S) so you don't need both. Also, I assume you don't want any more #s in the match, so we use [^#\s] which negates both whitespace and # characters.
My implementation of markdown turns double hyphens into endashes. E.g., a -- b becomes a – b
But sometimes users write a - b when they mean a -- b. I'd like a regular expression to fix this.
Obviously body.gsub(/ - /, " -- ") comes to mind, but this messes up markdown's unordered lists – i.e., if a line starts - list item, it will become -- list item. So solution must only swap out hyphens when there is a word character somewhere to their left
You can match a word character to the hyphen's left and use a backreference in the replacement string to put it back:
body.gsub(/(\w) - /, '\1 -- ')
Perhaps, if you want to be a little more accepting ...
gsub(/\b([ \t]+)-(?=[ \t]+)/, '\1--')
\b[ \t] forces a non-whitepace before the whitespace through a word boundary condition. I don't use \s to avoid line-runs. I also only use one capture to preserve the preceding whitespace (does Ruby 1.8.x have a ?<= ?).