Match exact phrase and words in regex - ruby

I'm splitting a search result string so I can use Rails Highlight to highlight the terms. In some cases, there will be exact matches and single words in the same search term and I'm trying to write regex that will do that in a single pass.
search_term = 'pizza cheese "ham and pineapple" pepperoni'
search_term.split(/\W+/)
=> ["pizza", "cheese", "ham", "and", "pineapple", "pepperoni"]
search_term.split(/(?=\")\W+/)
=> ["pizza cheese ", "ham and pineapple", "pepperoni"]
I can get ham and pineapple on its own (without the unwanted quotes), and I can easily split all the words, but is there some regex that will return an array like:
search_term.split(🤷‍♂️)
=> ["pizza", "cheese", "ham and pineapple", "pepperoni"]

Yes:
/"[^"]*?"|\w+/
https://regex101.com/r/fzHI4g/2
Not done as a split. Just take stuff in quotes, or single words...each one is a match.
ÂŁ cat pizza
pizza "a and b" pie
ÂŁ ruby -ne 'print $_.scan(/"[^"]*?"|\w+/)' pizza
["pizza", "\"a and b\"", "pie"]
ÂŁ
so...search_term.scan(/regex/) seems to return the array you want.
To exclude the quotes you need:
This puts the quotes in lookarounds which assert that the matched expression has a quote before it (lookbehind), and a quote after it (lookahead) rather than containing the quotes.
/(?<=")\w[^"]*?(?=")|\w+/
Note that because the last regex doesn't consume the quotes, it uses whitespace to determine beginning vs. ending quotes so " a bear" is not ok. This can be solved with capture groups, but if this is an issue, like I said in the comments, I would recommend just trimming quotes off each array element and using the regex at the top of the answer.

r = /
(?<=\") # match a double quote in a positive lookbehind
(?!\s) # next char cannot be a whitespace, negative lookahead
[^"]+ # match one or more characters other than double-quote
(?<!\s) # previous char cannot be a whitespace, negative lookbehind
(?=\") # match a double quote in a positive lookahead
| # or
\w+ # match one or more word characters
/x # free-spacing regex definition mode
str = 'pizza "ham and pineapple" mushroom pepperoni "sausage and anchovies"'
str.scan r
#=> ["pizza", "ham and pineapple", "mushroom", "pepperoni",
# "sausage and anchovies"]

Related

splitting a string misses the word which is used to split it

I have a string
a="Tamilnadu is far away from Kashmir"
If I split this string using "Tamilnadu", then I don't find Tamilnadu as a part of the array, I find empty string there, If I split the string "away" then away is not present in the result array, it's having empty string in the place of away. What should I do include it instead of having empty string.
Example
a="Tamilnadu is far away from Kashmir"
p a.split("Tamilnadu")
then Output is
["", " is far away from Kashmir"]
But I want
["Tamilnadu", " is far away from Kashmir"]
From docs:
If pattern is a Regexp, str is divided where the pattern matches. Whenever the pattern matches a zero-length string, str is split into individual characters. If pattern contains groups, the respective matches will be returned in the array as well.
So... to split by "Tamilnadu" and keep it in the list, make it a capture group:
"Tamilnadu is far away from Kashmir".split(/(Tamilnadu)/)
# => ["", "Tamilnadu", " is far away from Kashmir"]
or, if you want to split after "Tamilnadu", make a zero-width match after it using lookbehind:
"Tamilnadu is far away from Kashmir".split(/(?<=Tamilnadu)/)
# => ["Tamilnadu", " is far away from Kashmir"]
If you don't know where "Tamilnadu" is in the string but you want to split the string before and after it, and not have any empty strings in the resulting array, you can use String#scan:
def split_it(str, substring)
str.scan(/\A.+(?= #{substring}\b)|\b#{substring}\b|(?<=\b#{substring} ).+/)
end
substring = "Tamilnadu"
split_it("Tamilnadu is far away from Kashmir", substring)
#=> ["Tamilnadu", "is far away from Kashmir"]
split_it("Far away is Tamilnadu from Kashmir", substring)
#=> ["Far away is", "Tamilnadu", "from Kashmir"]
split_it("Far away from Kashmir is Tamilnadu", substring)
#=> ["Far away from Kashmir is", "Tamilnadu"]
split_it("Far away is Daluth from Kashmir", substring)
#=> []
split_it("Far away is Tamilnaduland from Kashmir", substring)
#=> []
I've assumed that substring appears at most once in the string.
The regular expression can be written in free-spacing mode to make it self-documenting:
substring = "Tamilnadu"
/
\A.+ # match the beginning of the string followed by > 0 characters
(?=\ #{substring}\b) # match the value of substring preceded by a space and
# followed by a word break, in a positive lookahead
| # or
\b#{substring}\b # match the value of substring with a word break before and after
| # or
(?<=\b#{substring}\ ) # match the value of substring preceded by a word break
# and followed by a space, in a positive lookbehind
.+ # match > 0 characters
/x # free-spacing regex definition mode
#=>
/
\A.+ # ...
(?=\ Tamilnadu\b) # ...
| # ...
\bTamilnadu\b # ...
| # ...
(?<=\bTamilnadu\ ) # ...
.+ # ...
/x
Free-spacing mode removes all spaces before the regex is parsed, including spaces that may be intended to be part of the expression. It was for that reason that I escaped the two spaces. I could alternatively put each in a character class ([ ]) or use \s, [[:space:]] or \p{Space}, though they match whitespace, which is not quite the same.

Regex to match `99` but not `!99`

My string is: "1 !2 3". I need to match all numbers except !2. I tried /\b\d{1,5}\b/, but it still matches !2. The \b anchor works well with words, but not digits.
What is the regex to solve my problem?
You need a negative lookbehind (?<!!) and use the word boundaries around \d+ (to exclude partial matches on 2+ digit numbers):
"1 !2 3".scan(/(?<!!)\b\d+\b/)
See IDEONE demo and a regex demo here. If you really plan to match numbers consisting of 1 to 5 digits, replace + quantifier (1 or more occurrences) with your {1,5} limiting quantifier.
The (?<!!) fails the match if a digit is preceded with an exclamation mark. The word boundaries require a non-word character on both sides of the digit chunks matched with \d+. As a ! is a non-word character (i.e. it belongs to the [^A-Za-z0-9_] character range), it is allowed if you just use a word boundary - that is why your regex did not work. Adding the lookbehind solves the issue.
You could use a regex that doesn't have a lookbehind:
r = /
\s*!\d+\s* # match >= 0 spaces, an exclamation mark, > 0 digits, >= 0 spaces
| # or
\s+ # match > 0 spaces
/x # free-spacing regex definition mode
"1 !2 3".split(r)
#=> ["1", "3"]
or two regexes:
"1 !2 3".gsub(/!\d+/, "").scan(/\d+/)
#=> ["1", "3"]
or no regexes:
"1 !2 3".split.reject { |s| s.start_with?("!") }
#=> ["1", "3"]

Capitalize the first character after a dash

So I've got a string that's an improperly formatted name. Let's say, "Jean-paul Bertaud-alain".
I want to use a regex in Ruby to find the first character after every dash and make it uppercase. So, in this case, I want to apply a method that would yield: "Jean-Paul Bertaud-Alain".
Any help?
String#gsub can take a block argument, so this is as simple as:
str = "Jean-paul Bertaud-alain"
str.gsub(/-[a-z]/) {|s| s.upcase }
# => "Jean-Paul Bertaud-Alain"
Or, more succinctly:
str.gsub(/-[a-z]/, &:upcase)
Note that the regular expression /-[a-z]/ will only match letters in the a-z range, meaning it won't match e.g. Ă . This is because String#upcase does not attempt to capitalize characters with diacritics anyway, because capitalization is language-dependent (e.g. i is capitalized differently in Turkish than in English). Read this answer for more information: https://stackoverflow.com/a/4418681
"Jean-paul Bertaud-alain".gsub(/(?<=-)\w/, &:upcase)
# => "Jean-Paul Bertaud-Alain"
I suggest you make the test more demanding by requiring the letter to be upcased: 1) be preceded by a capitalized word followed by a hypen and 2) be followed by lowercase letters followed by a word break.
r = /
\b # Match a word break
[A-Z] # Match an upper-case letter
[a-z]+ # Match >= 1 lower-case letters
\- # Match hypen
\K # Forget everything matched so far
[a-z] # Match a lower-case letter
(?= # Begin a positive lookahead
[a-z]+ # Match >= 1 lower-case letters
\b # Match a word break
) # End positive lookahead
/x # Free-spacing regex definition mode
"Jean-paul Bertaud-alain".gsub(r) { |s| s.upcase }
#=> "Jean-Paul Bertaud-Alain"
"Jean de-paul Bertaud-alainM".gsub(r) { |s| s.upcase }
#=> "Jean de-paul Bertaud-alainM"

Ruby Regular expression not matching properly

I am trying to creat a RegEx to find words that contains any vowel.
so far i have tried this
/(.*?\S[aeiou].*?[\s|\.])/i
but i have not used RegEx much so its not working properly.
for example if i input "test is 1234 and sky fly test1234"
it should match test , is, and, test1234 but showing
test, is,1234 and
if put something else then different output.
Alternatively you can also do something like:
"test is 1234 and sky fly test1234".split.find_all { |a| a =~ /[aeiou]/ }
# => ["test", "is", "and", "test1234"]
You could use the below regex.
\S*[aeiou]\S*
\S* matches zero or more non-space characters.
or
\w*[aeiou]\w*
It will solve:
\b\w*[aeiou]+\w*\b
https://www.debuggex.com/r/O-fU394iC5ErcSs7
or you can substitute \w by \S
\b\S*[aeiou]+\S*\b
https://www.debuggex.com/r/RNE6Y6q1q5yPJbe-
\b - a word boundary
\w - same as [_a-zA-Z0-9]
\S - a non-whitespace character
Try this:
\b\w*[aeiou]\w*\b
\b denotes a word boundry, so this regexp matches word bounty, zero or more letters, a vowel, zero or more letters and another word boundry

Ruby regex extracting words

I'm currently struggling to come up with a regex that can split up a string into words where words are defined as a sequence of characters surrounded by whitespace, or enclosed between double quotes. I'm using String#scan
For instance, the string:
' hello "my name" is "Tom"'
should match the words:
hello
my name
is
Tom
I managed to match the words enclosed in double quotes by using:
/"([^\"]*)"/
but I can't figure out how to incorporate the surrounded by whitespace characters to get 'hello', 'is', and 'Tom' while at the same time not screw up 'my name'.
Any help with this would be appreciated!
result = ' hello "my name" is "Tom"'.split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/)
will work for you. It will print
=> ["", "hello", "\"my name\"", "is", "\"Tom\""]
Just ignore the empty strings.
Explanation
"
\\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
(?: # Match the regular expression below
[^\"] # Match any character that is NOT a “\"”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\" # Match the character “\"” literally
[^\"] # Match any character that is NOT a “\"”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\" # Match the character “\"” literally
)* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
[^\"] # Match any character that is NOT a “\"”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\$ # Assert position at the end of a line (at the end of the string or before a line break character)
)
"
You can use reject like this to avoid empty strings
result = ' hello "my name" is "Tom"'
.split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/).reject {|s| s.empty?}
prints
=> ["hello", "\"my name\"", "is", "\"Tom\""]
text = ' hello "my name" is "Tom"'
text.scan(/\s*("([^"]+)"|\w+)\s*/).each {|match| puts match[1] || match[0]}
Produces:
hello
my name
is
Tom
Explanation:
0 or more spaces followed by
either
some words within double-quotes OR
a single word
followed by 0 or more spaces
You can try this regex:
/\b(\w+)\b/
which uses \b to find the word boundary. And this web site http://rubular.com/ is helpful.

Resources