Regex to match `99` but not `!99` - ruby

My string is: "1 !2 3". I need to match all numbers except !2. I tried /\b\d{1,5}\b/, but it still matches !2. The \b anchor works well with words, but not digits.
What is the regex to solve my problem?

You need a negative lookbehind (?<!!) and use the word boundaries around \d+ (to exclude partial matches on 2+ digit numbers):
"1 !2 3".scan(/(?<!!)\b\d+\b/)
See IDEONE demo and a regex demo here. If you really plan to match numbers consisting of 1 to 5 digits, replace + quantifier (1 or more occurrences) with your {1,5} limiting quantifier.
The (?<!!) fails the match if a digit is preceded with an exclamation mark. The word boundaries require a non-word character on both sides of the digit chunks matched with \d+. As a ! is a non-word character (i.e. it belongs to the [^A-Za-z0-9_] character range), it is allowed if you just use a word boundary - that is why your regex did not work. Adding the lookbehind solves the issue.

You could use a regex that doesn't have a lookbehind:
r = /
\s*!\d+\s* # match >= 0 spaces, an exclamation mark, > 0 digits, >= 0 spaces
| # or
\s+ # match > 0 spaces
/x # free-spacing regex definition mode
"1 !2 3".split(r)
#=> ["1", "3"]
or two regexes:
"1 !2 3".gsub(/!\d+/, "").scan(/\d+/)
#=> ["1", "3"]
or no regexes:
"1 !2 3".split.reject { |s| s.start_with?("!") }
#=> ["1", "3"]

Related

How to check if the first and last character of a word are the same in Ruby?

If I have a string that's a sentence, I want to check if the first and last letter of each word are the same and find which of the words have their first and last letter the same. For example:
sentence_one = "Label the bib numbers in red."
You could use a regex:
sentence_one = "Label the bib numbers in red"
sentence_one.scan(/(\b(\w)\w*(\2)\b)/i)
#=> [["Label", "L", "l"], ["bib", "b", "b"]]
\b is a word boundary, \w matches a letter (you may have to adjust this). There are 3 captures: (1) the whole word, (2) the first letter and (3) the last letter. Using \2 requires the last letter to match the first.
This will print out all words that start with and end with the same letter (not case-sensitive)
sentence_one = "Label the bib numbers in red"
words = sentence_one.split(' ')
words.each do |word|
if word[0].downcase == word[-1].downcase
puts word
end
end
sentence_one.scan(/\S+/).select{|s| s[0].downcase == s[-1].downcase}
# => ["Label", "bib"]
In a comment the OP asked how one could obtain a count of words having the desired property. Here's one way to do that. I assume that the desired property is that a word's first and last characters are the same, though possibly of different case. Here is a way to do that that does not produce an intermediate array whose elements would be counted.
r = /
\b # match a word break
(?: # begin a non-capture group
\p{Alpha} # match a letter
| # or
(\p{Alpha}) # match a letter in capture group 1
\p{Alpha}* # match zero or more letters
\1 # match the contents of capture group 1
) # end the non-capture group
\b # match a word break
/ix # case-indifferent and free-spacing regex definition modes
str = "How, now is that a brown cow?"
str.gsub(r).count
#=> 2
See String#gsub, in particular the case where there is only one argument and no block is provided.
Note
str.gsub(r).to_a
#=> ["that", "a"]
str.scan(r)
#=> [["t"], [nil]]
Sometimes it is awkward to use scan when the regular expression contains capture groups (see String#scan). Those problems often can be avoided by instead using gsub followed by to_a (or Enumerable#entries).
Just to add one option more splitting to array (skipping one letter words):
sentence_one = "Label the bib numbers in a red color"
sentence_one.split(' ').keep_if{ |w| w.end_with?(w[0].downcase) & (w.size > 1) }
#=> ["Label", "bib"]
sentence_one = "Label the bib numbers in red"
puts sentence_one.split(' ').count{|word| word[0] == word[-1]} # => 1

Match exact phrase and words in regex

I'm splitting a search result string so I can use Rails Highlight to highlight the terms. In some cases, there will be exact matches and single words in the same search term and I'm trying to write regex that will do that in a single pass.
search_term = 'pizza cheese "ham and pineapple" pepperoni'
search_term.split(/\W+/)
=> ["pizza", "cheese", "ham", "and", "pineapple", "pepperoni"]
search_term.split(/(?=\")\W+/)
=> ["pizza cheese ", "ham and pineapple", "pepperoni"]
I can get ham and pineapple on its own (without the unwanted quotes), and I can easily split all the words, but is there some regex that will return an array like:
search_term.split(🤷‍♂️)
=> ["pizza", "cheese", "ham and pineapple", "pepperoni"]
Yes:
/"[^"]*?"|\w+/
https://regex101.com/r/fzHI4g/2
Not done as a split. Just take stuff in quotes, or single words...each one is a match.
£ cat pizza
pizza "a and b" pie
£ ruby -ne 'print $_.scan(/"[^"]*?"|\w+/)' pizza
["pizza", "\"a and b\"", "pie"]
£
so...search_term.scan(/regex/) seems to return the array you want.
To exclude the quotes you need:
This puts the quotes in lookarounds which assert that the matched expression has a quote before it (lookbehind), and a quote after it (lookahead) rather than containing the quotes.
/(?<=")\w[^"]*?(?=")|\w+/
Note that because the last regex doesn't consume the quotes, it uses whitespace to determine beginning vs. ending quotes so " a bear" is not ok. This can be solved with capture groups, but if this is an issue, like I said in the comments, I would recommend just trimming quotes off each array element and using the regex at the top of the answer.
r = /
(?<=\") # match a double quote in a positive lookbehind
(?!\s) # next char cannot be a whitespace, negative lookahead
[^"]+ # match one or more characters other than double-quote
(?<!\s) # previous char cannot be a whitespace, negative lookbehind
(?=\") # match a double quote in a positive lookahead
| # or
\w+ # match one or more word characters
/x # free-spacing regex definition mode
str = 'pizza "ham and pineapple" mushroom pepperoni "sausage and anchovies"'
str.scan r
#=> ["pizza", "ham and pineapple", "mushroom", "pepperoni",
# "sausage and anchovies"]

Ruby regex multiple repeating captures

I'm trying to parse a subset of a webpage with regex for just fun. It was fun till I encountered with the following problem. I have a paragraph like below;
foo: 1, 2, 3, 4 and 5.
bar: 1, 2 and 3.
What I am trying to do is, get the numbers in the first line of the paragraph starting with foo: by applying following regex:
foo:(?:\s(\d)(?:,|\sand|\.))+
This matches with the above string but it captures only the last occurrence of the capture group which is 5.
How can I capture all the numbers in a paragraph starting with foo: till the first occurrence of . using single regex pattern.
Repeating capturing group's data aren't stored separately in most programming languages, hence you can't refer to them individually. This is a valid reason to use \G anchor. \G causes a match to start from where previous match ended or it will match beginning of string as same as \A.
So we are in need of its first capability:
(?:foo:|\G(?!\A))\s*(\d+)\s*(?:,|and)?
Breakdown:
(?: Start a non-capturing group
foo: Match foo:
| Or
\G(?!\A) Continue match from where previous match ends
) End of NCG
\s* Any number of whitespace characters
(\d+) Match and capture digits
\s* Any number of whitespae characters
(?:,|and)? Optional , or and
This regex will begin a match on meeting foo in input string. Then tries to find a following digit that precedes a comma or and (whitespaces are allowed around digits).
\K token will reset match. It means it will send a signal to engine to forget whatever is matched so far (but keep whatever is captured) and then leaves cursor right at that position.
I used \K in Rubular regex to make result set not to have matched strings but captured digits. However Rubular seems to work differently and didn't need \K. It's not a must at all.
This answer uses just one regex, but admittedly does a bit of pre- and post-processing. (Please allow me a bit of fun. I do think there may be some instructional value here.)
str = "foo: 1, 2, 34, 4 and 5. and 6."
r = /
\d+ # match one or more digits
(?=[^.]+:oof\z) # match one or more digits other than a period, followed
# by ":oof" at the end of the string, in a positive lookahead
/x # free-spacing regex definition mode
str.reverse.scan(r).join(' ').reverse.split
#=> ["1", "2", "34", "4", "5"]
The steps are as follows.
s = str.reverse
#=> ".6 dna .5 dna 4 ,43 ,2 ,1 :oof"
a = s.scan r
#=> ["5", "4", "43", "2", "1"]
b = a.join(' ')
#=> "5 4 43 2 1"
c = b.reverse
#=> "1 2 34 4 5"
c.split
#=> ["1", "2", "34", "4", "5"]
An empty array is returned if there is no match.
So, why all the reversing? It's to allow me to use a positive lookahead, which, unlike a positive lookbehind, permits variable-length matches.

How do I match something only if a character doesn't follow a pattern?

I"m using Ruby 2.4 How do I write a regular expression that matches a series of numbers, the plus sign and then any sequence that follows provided that sequence doesn't contain another number? For example, this would match per my rules
23+abcdef
as would this
1111111+ __++
But this would not
2+3
Neither would this
2+ L43
I tried this but was unsuccessful ...
/\d+[[:space:]]*(\+|plus).*([^\d]|$)/i.match(mystr)
r = /\A # match beginning of string
\d+ # match one or more digits
\+ # match plus sign
\D* # match zero or more characters other than a digit
\z # match end of string
/x # free-spacing regex definition mode
"23+abcdef".match?(r)
#=> true
"1111111+ __++".match?(r)
#=> true
"23 abcdef".match?(r)
#=> false
"2+3".match?(r)
#=> false
"2+ L43".match?(r)
#=> false
If at least one character that is not a digit is to follow '+', change \D* in the regex to \D+.

Capitalize the first character after a dash

So I've got a string that's an improperly formatted name. Let's say, "Jean-paul Bertaud-alain".
I want to use a regex in Ruby to find the first character after every dash and make it uppercase. So, in this case, I want to apply a method that would yield: "Jean-Paul Bertaud-Alain".
Any help?
String#gsub can take a block argument, so this is as simple as:
str = "Jean-paul Bertaud-alain"
str.gsub(/-[a-z]/) {|s| s.upcase }
# => "Jean-Paul Bertaud-Alain"
Or, more succinctly:
str.gsub(/-[a-z]/, &:upcase)
Note that the regular expression /-[a-z]/ will only match letters in the a-z range, meaning it won't match e.g. à. This is because String#upcase does not attempt to capitalize characters with diacritics anyway, because capitalization is language-dependent (e.g. i is capitalized differently in Turkish than in English). Read this answer for more information: https://stackoverflow.com/a/4418681
"Jean-paul Bertaud-alain".gsub(/(?<=-)\w/, &:upcase)
# => "Jean-Paul Bertaud-Alain"
I suggest you make the test more demanding by requiring the letter to be upcased: 1) be preceded by a capitalized word followed by a hypen and 2) be followed by lowercase letters followed by a word break.
r = /
\b # Match a word break
[A-Z] # Match an upper-case letter
[a-z]+ # Match >= 1 lower-case letters
\- # Match hypen
\K # Forget everything matched so far
[a-z] # Match a lower-case letter
(?= # Begin a positive lookahead
[a-z]+ # Match >= 1 lower-case letters
\b # Match a word break
) # End positive lookahead
/x # Free-spacing regex definition mode
"Jean-paul Bertaud-alain".gsub(r) { |s| s.upcase }
#=> "Jean-Paul Bertaud-Alain"
"Jean de-paul Bertaud-alainM".gsub(r) { |s| s.upcase }
#=> "Jean de-paul Bertaud-alainM"

Resources