I'm trying to parse a subset of a webpage with regex for just fun. It was fun till I encountered with the following problem. I have a paragraph like below;
foo: 1, 2, 3, 4 and 5.
bar: 1, 2 and 3.
What I am trying to do is, get the numbers in the first line of the paragraph starting with foo: by applying following regex:
foo:(?:\s(\d)(?:,|\sand|\.))+
This matches with the above string but it captures only the last occurrence of the capture group which is 5.
How can I capture all the numbers in a paragraph starting with foo: till the first occurrence of . using single regex pattern.
Repeating capturing group's data aren't stored separately in most programming languages, hence you can't refer to them individually. This is a valid reason to use \G anchor. \G causes a match to start from where previous match ended or it will match beginning of string as same as \A.
So we are in need of its first capability:
(?:foo:|\G(?!\A))\s*(\d+)\s*(?:,|and)?
Breakdown:
(?: Start a non-capturing group
foo: Match foo:
| Or
\G(?!\A) Continue match from where previous match ends
) End of NCG
\s* Any number of whitespace characters
(\d+) Match and capture digits
\s* Any number of whitespae characters
(?:,|and)? Optional , or and
This regex will begin a match on meeting foo in input string. Then tries to find a following digit that precedes a comma or and (whitespaces are allowed around digits).
\K token will reset match. It means it will send a signal to engine to forget whatever is matched so far (but keep whatever is captured) and then leaves cursor right at that position.
I used \K in Rubular regex to make result set not to have matched strings but captured digits. However Rubular seems to work differently and didn't need \K. It's not a must at all.
This answer uses just one regex, but admittedly does a bit of pre- and post-processing. (Please allow me a bit of fun. I do think there may be some instructional value here.)
str = "foo: 1, 2, 34, 4 and 5. and 6."
r = /
\d+ # match one or more digits
(?=[^.]+:oof\z) # match one or more digits other than a period, followed
# by ":oof" at the end of the string, in a positive lookahead
/x # free-spacing regex definition mode
str.reverse.scan(r).join(' ').reverse.split
#=> ["1", "2", "34", "4", "5"]
The steps are as follows.
s = str.reverse
#=> ".6 dna .5 dna 4 ,43 ,2 ,1 :oof"
a = s.scan r
#=> ["5", "4", "43", "2", "1"]
b = a.join(' ')
#=> "5 4 43 2 1"
c = b.reverse
#=> "1 2 34 4 5"
c.split
#=> ["1", "2", "34", "4", "5"]
An empty array is returned if there is no match.
So, why all the reversing? It's to allow me to use a positive lookahead, which, unlike a positive lookbehind, permits variable-length matches.
Related
If I have a string that's a sentence, I want to check if the first and last letter of each word are the same and find which of the words have their first and last letter the same. For example:
sentence_one = "Label the bib numbers in red."
You could use a regex:
sentence_one = "Label the bib numbers in red"
sentence_one.scan(/(\b(\w)\w*(\2)\b)/i)
#=> [["Label", "L", "l"], ["bib", "b", "b"]]
\b is a word boundary, \w matches a letter (you may have to adjust this). There are 3 captures: (1) the whole word, (2) the first letter and (3) the last letter. Using \2 requires the last letter to match the first.
This will print out all words that start with and end with the same letter (not case-sensitive)
sentence_one = "Label the bib numbers in red"
words = sentence_one.split(' ')
words.each do |word|
if word[0].downcase == word[-1].downcase
puts word
end
end
sentence_one.scan(/\S+/).select{|s| s[0].downcase == s[-1].downcase}
# => ["Label", "bib"]
In a comment the OP asked how one could obtain a count of words having the desired property. Here's one way to do that. I assume that the desired property is that a word's first and last characters are the same, though possibly of different case. Here is a way to do that that does not produce an intermediate array whose elements would be counted.
r = /
\b # match a word break
(?: # begin a non-capture group
\p{Alpha} # match a letter
| # or
(\p{Alpha}) # match a letter in capture group 1
\p{Alpha}* # match zero or more letters
\1 # match the contents of capture group 1
) # end the non-capture group
\b # match a word break
/ix # case-indifferent and free-spacing regex definition modes
str = "How, now is that a brown cow?"
str.gsub(r).count
#=> 2
See String#gsub, in particular the case where there is only one argument and no block is provided.
Note
str.gsub(r).to_a
#=> ["that", "a"]
str.scan(r)
#=> [["t"], [nil]]
Sometimes it is awkward to use scan when the regular expression contains capture groups (see String#scan). Those problems often can be avoided by instead using gsub followed by to_a (or Enumerable#entries).
Just to add one option more splitting to array (skipping one letter words):
sentence_one = "Label the bib numbers in a red color"
sentence_one.split(' ').keep_if{ |w| w.end_with?(w[0].downcase) & (w.size > 1) }
#=> ["Label", "bib"]
sentence_one = "Label the bib numbers in red"
puts sentence_one.split(' ').count{|word| word[0] == word[-1]} # => 1
I'm quite new to regular expressions. I am using the regular expression:
/\w+/
To check for words, and it's obvious that this will have problems with punctuation, but I'm not quite sure how to change this regular expression. For example, when I run this command from a class I made:
Wordify.new.regex(/\w+/).string("This sentence isn't 'the best-example, isn't it not?...").display
I get the output:
-----------
this: 1
sentence: 1
isn: 2
t: 2
the: 1
best: 1
example: 1
it: 1
not: 1
-----------
How can I adjust the regular expression so that it matches words with apostrophes, like: isn't as one word, but will only match the when searching 'the or the'. Hyphens in the middle of a word like stack-overflow should match return stack and overflow separately, which this already does.
Additionally, words shouldn't be able to start or end with numbers, like test1241 or 436test should become test, but te7st is okay. Plain numbers should not be recognised.
Sorry, I know this is a big ask, but I'm not sure where to start with regex. Would be grateful if you could also explain what the expression means if possible.
str = "This is 2a' 4test' of my agréable re4'gex, n'est-ce pas?"
r = /
[[:alpha:]] # match a letter
(?: # begin the outer non-capture group
(?:[[:alpha:]]|\d|') # match a letter, digit or apostrophe in a non-capture group
* # execute the above non-capture group zero or more times
[[:alpha:]] # match a letter
)? # close the outer non-capture group and make it optional
/x # free-spacing regex definition mode
str.scan r
#=> ["This", "is", "a", "test", "of", "my", "agréable", "re4'gex", "n'est", "ce", "pas"]
Note the outer capture group is needed in case the string to be matched is a single character.
Hmmm. Maybe we should add a hyphen to the inner non-capture group.
r = /[[:alpha:]](?:(?:[[:alpha:]]|\d|'|-)*[[:alpha:]])?/
str.scan r
#=> ["This", "is", "a", "test", "of", "my", "agréable", "re4'gex", "n'est-ce", "pas"]
I now rarely use the word-matching character \w, mainly because it matches the underscore, as well as letters and digits. Instead I reach for a POSIX bracket expression (search "POSIX"), which has the added (perhaps primary) benefit that it is not English-centric. For example, matching a word character with the exception of an underscore is [[:alnum:]].
You can do something basic using:
/[a-z]+(?:'[a-z]+)*/i
To extend it to allow words like a2b and avoid 123abc abc123 and or plain numbers:
/[a-z]+(?:'[a-z]+|\d+[a-z]+)*/i
There's no special regex features used in the two patterns, only basics.
Try scanning the string using the [[:alpha:]] POSIX character class:
s = "This a sentence isn't 'the best-example, isn't it not?... a1 2b 3c3 d4d 555 stack-overflow"
s.scan(/[[:alpha:]](?:['\w]*[[:alpha:]])?/)
# => ["This", "a", "sentence", "isn't", "the", "best", "example", "isn't", "it", "not", "a", "b", "c", "d4d", "stack", "overflow"]
[First attempt]
I split the string into tokens separated by whitespace or hyphens then clean up each token per your rules, since it seems like they might be adjusted as you refine your problem:
def tokenize(str)
tokens = str.split(/(?:\s+|-)/)
tokens.reduce([]) do |memo, token|
token.gsub!(/(^\W+|\W+$)/, '') # Strip enclosing non-words
token.gsub!(/(^\d+|\d+$)/, '') # Strip enclosing digits
memo + (token=='' ? [] : [token]) # Ignore the empty string
end
end
s = "This sentence isn't 'the best-example, isn't it not?... a1 2b 3c3 d4d 555 stack-overflow"
puts tokenize(s).inspect
# ["This", "sentence", "isn't", "the", "best", "example", "isn't", "it", "not", "a", "b", "c", "d4d", "stack", "overflow"]
Clearly this solution doesn't use just regular expressions but for my money it's much easier to understand and modify then (what I imagine) a big regex would look like!
My string is: "1 !2 3". I need to match all numbers except !2. I tried /\b\d{1,5}\b/, but it still matches !2. The \b anchor works well with words, but not digits.
What is the regex to solve my problem?
You need a negative lookbehind (?<!!) and use the word boundaries around \d+ (to exclude partial matches on 2+ digit numbers):
"1 !2 3".scan(/(?<!!)\b\d+\b/)
See IDEONE demo and a regex demo here. If you really plan to match numbers consisting of 1 to 5 digits, replace + quantifier (1 or more occurrences) with your {1,5} limiting quantifier.
The (?<!!) fails the match if a digit is preceded with an exclamation mark. The word boundaries require a non-word character on both sides of the digit chunks matched with \d+. As a ! is a non-word character (i.e. it belongs to the [^A-Za-z0-9_] character range), it is allowed if you just use a word boundary - that is why your regex did not work. Adding the lookbehind solves the issue.
You could use a regex that doesn't have a lookbehind:
r = /
\s*!\d+\s* # match >= 0 spaces, an exclamation mark, > 0 digits, >= 0 spaces
| # or
\s+ # match > 0 spaces
/x # free-spacing regex definition mode
"1 !2 3".split(r)
#=> ["1", "3"]
or two regexes:
"1 !2 3".gsub(/!\d+/, "").scan(/\d+/)
#=> ["1", "3"]
or no regexes:
"1 !2 3".split.reject { |s| s.start_with?("!") }
#=> ["1", "3"]
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 7 years ago.
I found and tested a regex to validate a time string such as 11:30 AM:
^(1[0-2]|0?[1-9]):([0-5][0-9])(\s[A|P]M)\)?$
I understand most of it except the beginning:
(1[0-2]|0?[1-9])
Can someone explain what is going on? 1[0-2] is there is a fist digit that can be between 0 and 2? And then I don't understand |0?[1-9].
(1[0-2]|0?[1-9])
| separates the regex into two parts, where
1[0-2]
matches 10, 11 or 12, and
0?[1-9]
matches 1 to 9, with an optional leading 0.
I will explain by writing the regex in extended mode, which permits comments:
r = /
^ # match the beginning of the string
( # begin capture group 1
1 # match 1
[0-2] # match one of the characters 0,1,2
| # or
0? # optionally match a zero
[1-9] # match one of the characters between 1 and 9
) # end capture group 1
: # match a colon
( # begin capture group 2
[0-5] # match one of the characters between 0 and 5
[0-9] # match one of the characters between 0 and 9
) # end capture group 2
( # begin capture group 3
\s # match one whitespace character
[A|P] # match one of the characters A, | or P
M # match M
) # end capture group 3
\)? # optionally match a right parenthesis
$ # match the end of the string
/x # extended mode
As noticed by #Mischa, [A|P] is incorrect. It should be [AP]. That's because "|" is just an ordinary character when it's within a character class.
Also, I think the regex would be improved by moving \s out of capture group 3. We therefore might write:
r = /^(1[0-2]|0?[1-9]):([0-5][0-9])\s([AP]M)\)?$/
It could be used thusly:
result = "11:39 PM" =~ r
if result
puts "It's #{$2} minutes past #{$1}, #{ $3=='AM' ? 'anti' : 'post' } meridiem."
else
# raise exception?
end
#=> It's 39 minutes past 11, post meridiem.
In words, the revised regex reads as follows:
match the beginning of the string.
match "10", "11", "12", or one of the digits "1" to "9", optionally preceded by a zero, and save the match to capture group 1.
match a colon.
match a digit between "0" and "5", then a digit between "0" and "9", and save the two digits to capture group 2.
match a whitespace character.
match "A", or "P", followed by "M", and save the two characters to capture group 3.
optionally match a right parenthesis.
match the end of the string.
This code splits a word into two strings at the first vowel. Why?
word = "banana"
parts = word.split(/([aeiou].*)/)
The key here is the regular expression (or regex) that is being used between the two /'s
[aeiou] says to look for the first instance of one of those characters.
. matches any single character
* modifies the previous thing to mean match 0 or more of it
(...) means capture everything enclosed between the parentheses
Translated to english this regular expression might read something like "Given a string, find the first vowel that is followed by zero or more characters. Collect that vowel and its following characters and set them aside."
The slightly more confusing part is the regex's interaction with the split method. The value the regex returns is 'anana'. And we can see that calling split with 'anana' doesn't have the same result:
'banana'.split('anana') #=> ["b"]
But when split is called with a regular expression that uses a capture group - or parentheses (...), then anything in that capture group will also be returned in the result of the split. Which is why:
'banana'.split /([aeiou].*)/ #=> ["b", "anana"]
If you want to learn more about how regular expressions work (particularly in ruby), Rubular is a great resource to fiddle with - http://www.rubular.com/r/XEUgPhOdlH
This is actually a bit tricky. This regexp
/[aeiou].*/
matches the string from the first vowel to the end of the string i.e. "anana". But if you were to split on that, you would only get the first letter since split doesn't include the splitting pattern:
"banana".split /[aeiou].*/
# ["b"]
But according to the String#split docs, if the splitting pattern is a regexp with a capture group, the capture groups are included in the result as well. Since the whole pattern is wrapped in a capture group, the result is that the string splits before the first vowel.
For example, if you change the regexp to have two capture groups, it splits further:
"banana".split /([aeiou])(.*)/
# ["b", "a", "nana"]
ANSWER FOR OLD TITLE
It's not really a Ruby's syntax, it's a standard Regular Expression's syntax that also implemented by Ruby.
* means zero or more of previous item
. means any character
[aeiou] means any character inside the brace
() means capture it
So that regex means: capture anything that starts with a, e, i, o, or u.
the word.split(/([aeiou].*)/) means, split the word variable based on anything that starts with letter a, e, i, o, or u.
See here fore more information.
ANSWER FOR NEW TITLE
Why does it split on the first vowel? It's not really like that.. What it does is, split by anything that start with vowels and capture it (the string that starts with vowels) also, see more example here:
word = 'banana'
word.split /[aeiou]/ # split by vowels
#=> ["b", "n", "n"]
word.split /([aeiou])/ # split by vowels and capture the vowels
#=> ["b", "a", "n", "a", "n", "a"]
word.split /[aeiou].*/ # split by anything that start with vowels
#=> ["b"]
word.split /([aeiou].*)/ # split by anything that start with vowels and capture the thing that start with vowels also
#=> ["b", "anana"]
ANSWER FOR OLD TITLE
If the * symbol not inside the regular expression // (Ruby's syntax), there are some possibilities:
multiplication 2 * 3 == 6, 'na' * 3 == 'nanana' # batman!
splat operation [*(1..4)] == [1,2,3,4], see more info here