I'm quite new to regular expressions. I am using the regular expression:
/\w+/
To check for words, and it's obvious that this will have problems with punctuation, but I'm not quite sure how to change this regular expression. For example, when I run this command from a class I made:
Wordify.new.regex(/\w+/).string("This sentence isn't 'the best-example, isn't it not?...").display
I get the output:
-----------
this: 1
sentence: 1
isn: 2
t: 2
the: 1
best: 1
example: 1
it: 1
not: 1
-----------
How can I adjust the regular expression so that it matches words with apostrophes, like: isn't as one word, but will only match the when searching 'the or the'. Hyphens in the middle of a word like stack-overflow should match return stack and overflow separately, which this already does.
Additionally, words shouldn't be able to start or end with numbers, like test1241 or 436test should become test, but te7st is okay. Plain numbers should not be recognised.
Sorry, I know this is a big ask, but I'm not sure where to start with regex. Would be grateful if you could also explain what the expression means if possible.
str = "This is 2a' 4test' of my agréable re4'gex, n'est-ce pas?"
r = /
[[:alpha:]] # match a letter
(?: # begin the outer non-capture group
(?:[[:alpha:]]|\d|') # match a letter, digit or apostrophe in a non-capture group
* # execute the above non-capture group zero or more times
[[:alpha:]] # match a letter
)? # close the outer non-capture group and make it optional
/x # free-spacing regex definition mode
str.scan r
#=> ["This", "is", "a", "test", "of", "my", "agréable", "re4'gex", "n'est", "ce", "pas"]
Note the outer capture group is needed in case the string to be matched is a single character.
Hmmm. Maybe we should add a hyphen to the inner non-capture group.
r = /[[:alpha:]](?:(?:[[:alpha:]]|\d|'|-)*[[:alpha:]])?/
str.scan r
#=> ["This", "is", "a", "test", "of", "my", "agréable", "re4'gex", "n'est-ce", "pas"]
I now rarely use the word-matching character \w, mainly because it matches the underscore, as well as letters and digits. Instead I reach for a POSIX bracket expression (search "POSIX"), which has the added (perhaps primary) benefit that it is not English-centric. For example, matching a word character with the exception of an underscore is [[:alnum:]].
You can do something basic using:
/[a-z]+(?:'[a-z]+)*/i
To extend it to allow words like a2b and avoid 123abc abc123 and or plain numbers:
/[a-z]+(?:'[a-z]+|\d+[a-z]+)*/i
There's no special regex features used in the two patterns, only basics.
Try scanning the string using the [[:alpha:]] POSIX character class:
s = "This a sentence isn't 'the best-example, isn't it not?... a1 2b 3c3 d4d 555 stack-overflow"
s.scan(/[[:alpha:]](?:['\w]*[[:alpha:]])?/)
# => ["This", "a", "sentence", "isn't", "the", "best", "example", "isn't", "it", "not", "a", "b", "c", "d4d", "stack", "overflow"]
[First attempt]
I split the string into tokens separated by whitespace or hyphens then clean up each token per your rules, since it seems like they might be adjusted as you refine your problem:
def tokenize(str)
tokens = str.split(/(?:\s+|-)/)
tokens.reduce([]) do |memo, token|
token.gsub!(/(^\W+|\W+$)/, '') # Strip enclosing non-words
token.gsub!(/(^\d+|\d+$)/, '') # Strip enclosing digits
memo + (token=='' ? [] : [token]) # Ignore the empty string
end
end
s = "This sentence isn't 'the best-example, isn't it not?... a1 2b 3c3 d4d 555 stack-overflow"
puts tokenize(s).inspect
# ["This", "sentence", "isn't", "the", "best", "example", "isn't", "it", "not", "a", "b", "c", "d4d", "stack", "overflow"]
Clearly this solution doesn't use just regular expressions but for my money it's much easier to understand and modify then (what I imagine) a big regex would look like!
Related
If I have a string that's a sentence, I want to check if the first and last letter of each word are the same and find which of the words have their first and last letter the same. For example:
sentence_one = "Label the bib numbers in red."
You could use a regex:
sentence_one = "Label the bib numbers in red"
sentence_one.scan(/(\b(\w)\w*(\2)\b)/i)
#=> [["Label", "L", "l"], ["bib", "b", "b"]]
\b is a word boundary, \w matches a letter (you may have to adjust this). There are 3 captures: (1) the whole word, (2) the first letter and (3) the last letter. Using \2 requires the last letter to match the first.
This will print out all words that start with and end with the same letter (not case-sensitive)
sentence_one = "Label the bib numbers in red"
words = sentence_one.split(' ')
words.each do |word|
if word[0].downcase == word[-1].downcase
puts word
end
end
sentence_one.scan(/\S+/).select{|s| s[0].downcase == s[-1].downcase}
# => ["Label", "bib"]
In a comment the OP asked how one could obtain a count of words having the desired property. Here's one way to do that. I assume that the desired property is that a word's first and last characters are the same, though possibly of different case. Here is a way to do that that does not produce an intermediate array whose elements would be counted.
r = /
\b # match a word break
(?: # begin a non-capture group
\p{Alpha} # match a letter
| # or
(\p{Alpha}) # match a letter in capture group 1
\p{Alpha}* # match zero or more letters
\1 # match the contents of capture group 1
) # end the non-capture group
\b # match a word break
/ix # case-indifferent and free-spacing regex definition modes
str = "How, now is that a brown cow?"
str.gsub(r).count
#=> 2
See String#gsub, in particular the case where there is only one argument and no block is provided.
Note
str.gsub(r).to_a
#=> ["that", "a"]
str.scan(r)
#=> [["t"], [nil]]
Sometimes it is awkward to use scan when the regular expression contains capture groups (see String#scan). Those problems often can be avoided by instead using gsub followed by to_a (or Enumerable#entries).
Just to add one option more splitting to array (skipping one letter words):
sentence_one = "Label the bib numbers in a red color"
sentence_one.split(' ').keep_if{ |w| w.end_with?(w[0].downcase) & (w.size > 1) }
#=> ["Label", "bib"]
sentence_one = "Label the bib numbers in red"
puts sentence_one.split(' ').count{|word| word[0] == word[-1]} # => 1
I'm trying to parse a subset of a webpage with regex for just fun. It was fun till I encountered with the following problem. I have a paragraph like below;
foo: 1, 2, 3, 4 and 5.
bar: 1, 2 and 3.
What I am trying to do is, get the numbers in the first line of the paragraph starting with foo: by applying following regex:
foo:(?:\s(\d)(?:,|\sand|\.))+
This matches with the above string but it captures only the last occurrence of the capture group which is 5.
How can I capture all the numbers in a paragraph starting with foo: till the first occurrence of . using single regex pattern.
Repeating capturing group's data aren't stored separately in most programming languages, hence you can't refer to them individually. This is a valid reason to use \G anchor. \G causes a match to start from where previous match ended or it will match beginning of string as same as \A.
So we are in need of its first capability:
(?:foo:|\G(?!\A))\s*(\d+)\s*(?:,|and)?
Breakdown:
(?: Start a non-capturing group
foo: Match foo:
| Or
\G(?!\A) Continue match from where previous match ends
) End of NCG
\s* Any number of whitespace characters
(\d+) Match and capture digits
\s* Any number of whitespae characters
(?:,|and)? Optional , or and
This regex will begin a match on meeting foo in input string. Then tries to find a following digit that precedes a comma or and (whitespaces are allowed around digits).
\K token will reset match. It means it will send a signal to engine to forget whatever is matched so far (but keep whatever is captured) and then leaves cursor right at that position.
I used \K in Rubular regex to make result set not to have matched strings but captured digits. However Rubular seems to work differently and didn't need \K. It's not a must at all.
This answer uses just one regex, but admittedly does a bit of pre- and post-processing. (Please allow me a bit of fun. I do think there may be some instructional value here.)
str = "foo: 1, 2, 34, 4 and 5. and 6."
r = /
\d+ # match one or more digits
(?=[^.]+:oof\z) # match one or more digits other than a period, followed
# by ":oof" at the end of the string, in a positive lookahead
/x # free-spacing regex definition mode
str.reverse.scan(r).join(' ').reverse.split
#=> ["1", "2", "34", "4", "5"]
The steps are as follows.
s = str.reverse
#=> ".6 dna .5 dna 4 ,43 ,2 ,1 :oof"
a = s.scan r
#=> ["5", "4", "43", "2", "1"]
b = a.join(' ')
#=> "5 4 43 2 1"
c = b.reverse
#=> "1 2 34 4 5"
c.split
#=> ["1", "2", "34", "4", "5"]
An empty array is returned if there is no match.
So, why all the reversing? It's to allow me to use a positive lookahead, which, unlike a positive lookbehind, permits variable-length matches.
I checked the documentation, and cannot find what [\w-] means. Can anyone tell me what [\w-] means in Ruby?
The square brackets [] denote a character class. A character class will match any of the things inside it.
\w is a special class called "word characters". It is shorthand for [a-zA-Z0-9_], so it will match:
a-z (all lowercase letters)
A-Z (all uppercase letters)
0-9 (all digits)
_ (an underscore)
The class you are asking about, [\w-], is a class consisting of \w and -. So it will match the above list, plus hyphens (-).
Exactly as written, [\w-], this regex would match a single character, as long as it's in the above list, or is a dash.
If you were to add a quantifier to the end, e.g. [\w-]* or [\w-]+, then it would match any of these strings:
fooBar9
foo-Bar9
foo-Bar-9
-foo-Bar---9abc__34ab12d
And it would partially match these:
foo,Bar9 # match 'foo' - the ',' stops the match
-foo-Bar---9*bc__34ab12d # match '-foo-Bar---9', the '*' stops the match
\w Any word character (letter, number, underscore)
Here is what I think it is doing : Go to Rubular and try it as follow:
regex_1 /\w-/
String : f-oo
regext_1 will only match f- and will stop right at - ignoring any \w .. the rest of the string oo
Whereas :
regex_2 /[\w-]/
string : f-oo
regex_2 will match the entire string plus the special char - .. f-oo
.. Also , tested the case of a string being like f-1oo , and the second regex stopped the match at f- Hence, - is followed by a \d
==========
I belive the whole point of [] is to continue matching before and after - . Here are some variations I tried from irb.
irb(main):004:0> "blah-blah".scan(/\w-/)
=> ["h-"]
irb(main):005:0> "blah-blah".scan(/[\w-]/)
=> ["b", "l", "a", "h", "-", "b", "l", "a", "h"]
irb(main):006:0> "blah-blah".scan(/\w-\w/)
=> ["h-b"]
irb(main):007:0> "blah-blah".scan(/\w-\w*$/)
=> ["h-blah"]
irb(main):008:0> "blah-blah".scan(/\w*-\w*$/)
=> ["blah-blah"]
I can't seem to get a regex that matches either a hashtag #, an #, or a word-boundary. The goal is to break a string into Twitter-like entities and topics so:
input = "Hello #world, #ruby anotherString"
input.scan(entitiesRegex)
# => ["Hello", "#world", "#ruby", "anotherString"]
To get just the words, excluding "anotherString" which is too large, is simple:
/\b\w{3,12}\b/
will return ["Hello", "world", "ruby"]. Unfortunately this doesn't include the hashtags and #s. It seems like it should work simply with:
/[\b##]\w{3,12}\b/
but that returns ["#world", "#ruby"]. This made me realize that word boundaries are not by definition a character, so they don't fall into the category of "A single character" and, so, won't match. A few more attempts:
/\b|[##]\w{3,12}\b/
returns ["", "", "#world", "", "#ruby", "", "", ""].
/((\b|[##])\w{3,12}\b)/
matches the right things, but returns [[""], ["#"], ["#"], [""]] as expected, because the braces also mean capture everything enclosed.
/((\b|[##])\w{3,12}\b)/
kind of works. It returns [["Hello", ""], ["#world", "#"], ["#ruby", "#"]]. So now all the correct items are there, they're just located at the first element of each of the subarrays. The following snippet technically works:
input.scan(/((\b|[##])\w{3,12}\b)/).collect(&:first)
Is it possible to simplify this to match and return the correct substrings with just the regular expression not requiring the collect post-processing?
You can just use the regular expression /[##]?\b\w+\b/. That is, optionally match a # or #, followed by a word boundary (in #ruby, that boundary would be between # and ruby, in a normal word it would also match at the start of the word) and a bunch of word characters.
p "Hello #world, #ruby anotherString".scan(/[##]?\b\w+\b/)
# => ["Hello", "#world", "#ruby", "anotherString"]
Furthermore, you can adjust the number of characters a matching word should have with quantifiers. You gave an example in a comment to a deleted answer to match only #ruby by using {3,4}:
p "Hello #world, #ruby anotherString".scan(/[##]?\b\w{3,4}\b/)
# => ["#ruby"]
I just got a bit concept about \B and \b . And accordinlgy tried a code(taken from internet)but couldn't understand that - how the output has been generated by those regexp Anchors. So any one please help me to understand the difference between \B and \b by saying internally how they approach in pattern matching in Ruby?
Interactive ruby ready.
> str = "Hit him on the head\n" +
"Hit him on the head with a 2×4\n"
=> "Hit him on the head
Hit him on the head with a 2??4
"
> str.scan(/\w+\B/)
=> ["Hi", "hi", "o", "th", "hea", "Hi", "hi", "o", "th", "hea", "wit"]
> str.scan(/\w+\b/)
=> ["Hit", "him", "on", "the", "head", "Hit", "him", "on", "the", "head", "with", "a", "2", "4"]
>
Thanks,
Like most lower/upper case pairs, they are exact opposites:
\b matches a word boundary – that is, it matches between two letters (since it’s a zero-width match, i.e. it doesn’t consume a character when matching) where one belongs to a word and the other doesn’t. In the text “this person”, \b would match the following positions (denoted by a vertical bar): “|this| |person|”.
\B matches anywhere but at a word boundary. It would match at these positions: “t|h|i|s p|e|r|s|o|n” – that is, between all letters, but not between a letter and a non-letter character.
So if you have \w+\b and match “this person“ then you get as a result “this” because + is greedy and matches as many word characters (\w) as possible, up to the next word boundary.
\w+\B operates similarly, but it cannot match “this” since that is followed by a word boundary, which \B forbids. So the engine backtracks one character and matches “thi” instead.