I checked the documentation, and cannot find what [\w-] means. Can anyone tell me what [\w-] means in Ruby?
The square brackets [] denote a character class. A character class will match any of the things inside it.
\w is a special class called "word characters". It is shorthand for [a-zA-Z0-9_], so it will match:
a-z (all lowercase letters)
A-Z (all uppercase letters)
0-9 (all digits)
_ (an underscore)
The class you are asking about, [\w-], is a class consisting of \w and -. So it will match the above list, plus hyphens (-).
Exactly as written, [\w-], this regex would match a single character, as long as it's in the above list, or is a dash.
If you were to add a quantifier to the end, e.g. [\w-]* or [\w-]+, then it would match any of these strings:
fooBar9
foo-Bar9
foo-Bar-9
-foo-Bar---9abc__34ab12d
And it would partially match these:
foo,Bar9 # match 'foo' - the ',' stops the match
-foo-Bar---9*bc__34ab12d # match '-foo-Bar---9', the '*' stops the match
\w Any word character (letter, number, underscore)
Here is what I think it is doing : Go to Rubular and try it as follow:
regex_1 /\w-/
String : f-oo
regext_1 will only match f- and will stop right at - ignoring any \w .. the rest of the string oo
Whereas :
regex_2 /[\w-]/
string : f-oo
regex_2 will match the entire string plus the special char - .. f-oo
.. Also , tested the case of a string being like f-1oo , and the second regex stopped the match at f- Hence, - is followed by a \d
==========
I belive the whole point of [] is to continue matching before and after - . Here are some variations I tried from irb.
irb(main):004:0> "blah-blah".scan(/\w-/)
=> ["h-"]
irb(main):005:0> "blah-blah".scan(/[\w-]/)
=> ["b", "l", "a", "h", "-", "b", "l", "a", "h"]
irb(main):006:0> "blah-blah".scan(/\w-\w/)
=> ["h-b"]
irb(main):007:0> "blah-blah".scan(/\w-\w*$/)
=> ["h-blah"]
irb(main):008:0> "blah-blah".scan(/\w*-\w*$/)
=> ["blah-blah"]
Related
I have a string
a="Tamilnadu is far away from Kashmir"
If I split this string using "Tamilnadu", then I don't find Tamilnadu as a part of the array, I find empty string there, If I split the string "away" then away is not present in the result array, it's having empty string in the place of away. What should I do include it instead of having empty string.
Example
a="Tamilnadu is far away from Kashmir"
p a.split("Tamilnadu")
then Output is
["", " is far away from Kashmir"]
But I want
["Tamilnadu", " is far away from Kashmir"]
From docs:
If pattern is a Regexp, str is divided where the pattern matches. Whenever the pattern matches a zero-length string, str is split into individual characters. If pattern contains groups, the respective matches will be returned in the array as well.
So... to split by "Tamilnadu" and keep it in the list, make it a capture group:
"Tamilnadu is far away from Kashmir".split(/(Tamilnadu)/)
# => ["", "Tamilnadu", " is far away from Kashmir"]
or, if you want to split after "Tamilnadu", make a zero-width match after it using lookbehind:
"Tamilnadu is far away from Kashmir".split(/(?<=Tamilnadu)/)
# => ["Tamilnadu", " is far away from Kashmir"]
If you don't know where "Tamilnadu" is in the string but you want to split the string before and after it, and not have any empty strings in the resulting array, you can use String#scan:
def split_it(str, substring)
str.scan(/\A.+(?= #{substring}\b)|\b#{substring}\b|(?<=\b#{substring} ).+/)
end
substring = "Tamilnadu"
split_it("Tamilnadu is far away from Kashmir", substring)
#=> ["Tamilnadu", "is far away from Kashmir"]
split_it("Far away is Tamilnadu from Kashmir", substring)
#=> ["Far away is", "Tamilnadu", "from Kashmir"]
split_it("Far away from Kashmir is Tamilnadu", substring)
#=> ["Far away from Kashmir is", "Tamilnadu"]
split_it("Far away is Daluth from Kashmir", substring)
#=> []
split_it("Far away is Tamilnaduland from Kashmir", substring)
#=> []
I've assumed that substring appears at most once in the string.
The regular expression can be written in free-spacing mode to make it self-documenting:
substring = "Tamilnadu"
/
\A.+ # match the beginning of the string followed by > 0 characters
(?=\ #{substring}\b) # match the value of substring preceded by a space and
# followed by a word break, in a positive lookahead
| # or
\b#{substring}\b # match the value of substring with a word break before and after
| # or
(?<=\b#{substring}\ ) # match the value of substring preceded by a word break
# and followed by a space, in a positive lookbehind
.+ # match > 0 characters
/x # free-spacing regex definition mode
#=>
/
\A.+ # ...
(?=\ Tamilnadu\b) # ...
| # ...
\bTamilnadu\b # ...
| # ...
(?<=\bTamilnadu\ ) # ...
.+ # ...
/x
Free-spacing mode removes all spaces before the regex is parsed, including spaces that may be intended to be part of the expression. It was for that reason that I escaped the two spaces. I could alternatively put each in a character class ([ ]) or use \s, [[:space:]] or \p{Space}, though they match whitespace, which is not quite the same.
I'm having trouble splitting a character from a string using a regular expression, assuming there is a match.
I want to split off either an "m" or an "f" character from the first part of a string assuming the next character is one or more numbers followed by optional space characters, followed by a string from an array I have.
I tried:
2.4.0 :006 > MY_SEPARATOR_TOKENS = ["-", " to "]
=> ["-", " to "]
2.4.0 :008 > str = "M14-19"
=> "M14-19"
2.4.0 :011 > str.split(/^(m|f)\d+[[:space:]]*#{Regexp.union(MY_SEPARATOR_TOKENS)}/i)
=> ["", "M", "19"]
Notice the extraneous "" element at the beginning of my array and also notice that the last expression is just "19" whereas I would want everything else in the string ("14-19").
How do I adjust my regular expression so that only the parts of the expression that get split end up in the array?
I find match to be a bit more elegant when extracting characters from regular expressions in Ruby:
string = "M14-19"
string.match(/\A(?<m>[M|F])(?<digits>\d{2}(-| to )\d{2})/)[1, 2]
=> ["M", "14-19"]
# also can extract the symbols from match
extract_string = string.match(/\A(?<m>[M|F])(?<digits>\d{2}(-| to )\d{2})/)
[[extract_string[:m], extract_string[:digits]]
=> ["M", "14-19"]
string = 'M14 to 14'
extract_string = string.match(/\A(?<m>[M|F])(?<digits>\d{2}(-| to )\d{2})/)[1, 2]
=> ["M", "14 to 14"]
TOKENS = ["-", " to "]
r = /
(?<=\A[mMfF]) # match the beginning of the string and then one
# of the 4 characters in a positive lookbehind
(?= # begin positive lookahead
\d+ # match one or more digits
[[:space:]]* # match zero or more spaces
(?:#{TOKENS.join('|')}) # match one of the tokens
) # close the positive lookahead
/x # free-spacing regex definition mode
(?:#{TOKENS.join('|')}) is replaced by (?:-| to ).
This can of course be written in the usual way.
r = /(?<=\A[mMfF])(?=\d+[[:space:]]*(?:#{TOKENS.join('|')}))/
When splitting on r you are splitting between two characters (between a positive lookbehind and a positive lookahead) so no characters are consumed.
"M14-19".split r
#=> ["M", "14-19"]
"M14 to 19".split r
#=> ["M", "14 to 19"]
"M14 To 19".split r
#=> ["M14 To 19"]
If it is desired that ["M", "14 To 19"] be returned in the last example, change [mMfF] to [mf] and /x to /xi.
You have a bug brewing in your code. Don't get in the habit of doing this:
#{Regexp.union(MY_SEPARATOR_TOKENS)}
You're setting yourself up with a very hard to debug problem.
Here's what's happening:
regex = Regexp.union(%w(a b)) # => /a|b/
/#{regex}/ # => /(?-mix:a|b)/
/#{regex.source}/ # => /a|b/
/(?-mix:a|b)/ is an embedded sub-pattern with its set of the regex flags m, i and x which are independent of the surrounding pattern's settings.
Consider this situation:
'CAT'[/#{regex}/i] # => nil
We'd expect that the regular expression i flag would match because it's ignoring case, but the sub-expression still only allows only lowercase, causing the match to fail.
Using the bare (a|b) or adding source succeeds because the inner expression gets the main expression's i:
'CAT'[/(a|b)/i] # => "A"
'CAT'[/#{regex.source}/i] # => "A"
See "How to embed regular expressions in other regular expressions in Ruby" for additional discussion of this.
The empty element will always be there if you get a match, because the captured part appears at the beginning of the string and the string between the start of the string and the match is added to the resulting array, be it an empty or non-empty string. Either shift/drop it once you get a match, or just remove all empty array elements with .reject { |c| c.empty? } (see How do I remove blank elements from an array?).
Then, 14- is eaten up (consumed) by the \d+[[:space:]]... pattern part - put it into a (?=...) lookahead that will just check for the pattern match, but won't consume the characters.
Use something like
MY_SEPARATOR_TOKENS = ["-", " to "]
s = "M14-19"
puts s.split(/^(m|f)(?=\d+[[:space:]]*#{Regexp.union(MY_SEPARATOR_TOKENS)})/i).drop(1)
#=> ["M", "14-19"]
See Ruby demo
I'm quite new to regular expressions. I am using the regular expression:
/\w+/
To check for words, and it's obvious that this will have problems with punctuation, but I'm not quite sure how to change this regular expression. For example, when I run this command from a class I made:
Wordify.new.regex(/\w+/).string("This sentence isn't 'the best-example, isn't it not?...").display
I get the output:
-----------
this: 1
sentence: 1
isn: 2
t: 2
the: 1
best: 1
example: 1
it: 1
not: 1
-----------
How can I adjust the regular expression so that it matches words with apostrophes, like: isn't as one word, but will only match the when searching 'the or the'. Hyphens in the middle of a word like stack-overflow should match return stack and overflow separately, which this already does.
Additionally, words shouldn't be able to start or end with numbers, like test1241 or 436test should become test, but te7st is okay. Plain numbers should not be recognised.
Sorry, I know this is a big ask, but I'm not sure where to start with regex. Would be grateful if you could also explain what the expression means if possible.
str = "This is 2a' 4test' of my agréable re4'gex, n'est-ce pas?"
r = /
[[:alpha:]] # match a letter
(?: # begin the outer non-capture group
(?:[[:alpha:]]|\d|') # match a letter, digit or apostrophe in a non-capture group
* # execute the above non-capture group zero or more times
[[:alpha:]] # match a letter
)? # close the outer non-capture group and make it optional
/x # free-spacing regex definition mode
str.scan r
#=> ["This", "is", "a", "test", "of", "my", "agréable", "re4'gex", "n'est", "ce", "pas"]
Note the outer capture group is needed in case the string to be matched is a single character.
Hmmm. Maybe we should add a hyphen to the inner non-capture group.
r = /[[:alpha:]](?:(?:[[:alpha:]]|\d|'|-)*[[:alpha:]])?/
str.scan r
#=> ["This", "is", "a", "test", "of", "my", "agréable", "re4'gex", "n'est-ce", "pas"]
I now rarely use the word-matching character \w, mainly because it matches the underscore, as well as letters and digits. Instead I reach for a POSIX bracket expression (search "POSIX"), which has the added (perhaps primary) benefit that it is not English-centric. For example, matching a word character with the exception of an underscore is [[:alnum:]].
You can do something basic using:
/[a-z]+(?:'[a-z]+)*/i
To extend it to allow words like a2b and avoid 123abc abc123 and or plain numbers:
/[a-z]+(?:'[a-z]+|\d+[a-z]+)*/i
There's no special regex features used in the two patterns, only basics.
Try scanning the string using the [[:alpha:]] POSIX character class:
s = "This a sentence isn't 'the best-example, isn't it not?... a1 2b 3c3 d4d 555 stack-overflow"
s.scan(/[[:alpha:]](?:['\w]*[[:alpha:]])?/)
# => ["This", "a", "sentence", "isn't", "the", "best", "example", "isn't", "it", "not", "a", "b", "c", "d4d", "stack", "overflow"]
[First attempt]
I split the string into tokens separated by whitespace or hyphens then clean up each token per your rules, since it seems like they might be adjusted as you refine your problem:
def tokenize(str)
tokens = str.split(/(?:\s+|-)/)
tokens.reduce([]) do |memo, token|
token.gsub!(/(^\W+|\W+$)/, '') # Strip enclosing non-words
token.gsub!(/(^\d+|\d+$)/, '') # Strip enclosing digits
memo + (token=='' ? [] : [token]) # Ignore the empty string
end
end
s = "This sentence isn't 'the best-example, isn't it not?... a1 2b 3c3 d4d 555 stack-overflow"
puts tokenize(s).inspect
# ["This", "sentence", "isn't", "the", "best", "example", "isn't", "it", "not", "a", "b", "c", "d4d", "stack", "overflow"]
Clearly this solution doesn't use just regular expressions but for my money it's much easier to understand and modify then (what I imagine) a big regex would look like!
I need help in understanding how the following works.
"middl'-.*$%ddlemiddlemiddlemiddlemiddlemiddlemiExcess".gsub(/[^a-zA-Z'-.]/, '')
# => "middl'-.*ddlemiddlemiddlemiddlemiddlemiddlemiExcess"
"middl'-.*$%ddlemiddlemiddlemiddlemiddlemiddlemiExcess".gsub(/[^a-zA-Z.'-]/, '')
# => "middl'-.ddlemiddlemiddlemiddlemiddlemiddlemiExcess"
When I give /[^a-zA-Z'-.]/, then the star is not removed, but in the second example, the star is removed. Why?
I want the result after gsub to have only letters (a-zA-Z), period (.), hypen (-), single apostphe (') to exist. Just by changing the period position inside regular expression the output is different ?
In /[^a-zA-Z'-.]/ hyphen is treated as range delimiter, exactly as in A-Z before. The range is:
▶ ("'"..'.').to_a
#⇒ ["'", "(", ")", "*", "+", ",", "-", "."] # note asterisk
In /[^a-zA-Z.'-]/ hyphen is the last symbol and hence it is treated as hyphen itself.
I am trying to creat a RegEx to find words that contains any vowel.
so far i have tried this
/(.*?\S[aeiou].*?[\s|\.])/i
but i have not used RegEx much so its not working properly.
for example if i input "test is 1234 and sky fly test1234"
it should match test , is, and, test1234 but showing
test, is,1234 and
if put something else then different output.
Alternatively you can also do something like:
"test is 1234 and sky fly test1234".split.find_all { |a| a =~ /[aeiou]/ }
# => ["test", "is", "and", "test1234"]
You could use the below regex.
\S*[aeiou]\S*
\S* matches zero or more non-space characters.
or
\w*[aeiou]\w*
It will solve:
\b\w*[aeiou]+\w*\b
https://www.debuggex.com/r/O-fU394iC5ErcSs7
or you can substitute \w by \S
\b\S*[aeiou]+\S*\b
https://www.debuggex.com/r/RNE6Y6q1q5yPJbe-
\b - a word boundary
\w - same as [_a-zA-Z0-9]
\S - a non-whitespace character
Try this:
\b\w*[aeiou]\w*\b
\b denotes a word boundry, so this regexp matches word bounty, zero or more letters, a vowel, zero or more letters and another word boundry