Split string on capital letters, but not if preceded by whitespace

Split string on capital letters, but not if preceded by whitespace - ruby

I have a string that looks like
"AaaBbbCcc DddEee"
I'm splitting it with
my_string.scan(/[A-Z][a-z]+/)
and the result is
["Aaa", "Bbb", "Ccc", "Ddd", "Eee"]
What I'd like to achieve is to not split the string if the capital letter is preceded by a white space, so the result would look like
["Aaa", "Bbb", "Ccc Ddd", "Eee"]

my_string.split(/(?<!\s)(?=[A-Z])/)
This matches positions that are not preceded by a whitespace (negative lookbehind - (?<!\s)) and are followed by a capital letter (positive lookahead - (?=[A-Z])).

If you do not need to split or if the number of spaces in between the desired matches can be different, you may use your own approach and match additionally zero or more sequences of whitespace(s) + [A-Z][a-z]+ by adding (?:\s+[A-Z][a-z]+)* subpattern:
my_string.scan(/[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*/)
See the Ruby demo
To shorten it a bit, you may build the regex dynamically (see demo here):
my_string = 'AaaBbbCcc DddEee'
block = "[A-Z][a-z]+"
puts my_string.scan(/#{block}(?:\s+#{block})*/)
And here is a Unicode-friendly version of the above regex (online demo):
my_string.scan(/\p{Lu}\p{Ll}+(?:\s+\p{Lu}\p{Ll}+)*/)
where \p{Lu} matches any uppercase letter and \p{Ll} matches any lowercase letter.

Related

Match exact phrase and words in regex

I'm splitting a search result string so I can use Rails Highlight to highlight the terms. In some cases, there will be exact matches and single words in the same search term and I'm trying to write regex that will do that in a single pass.
search_term = 'pizza cheese "ham and pineapple" pepperoni'
search_term.split(/\W+/)
=> ["pizza", "cheese", "ham", "and", "pineapple", "pepperoni"]
search_term.split(/(?=\")\W+/)
=> ["pizza cheese ", "ham and pineapple", "pepperoni"]
I can get ham and pineapple on its own (without the unwanted quotes), and I can easily split all the words, but is there some regex that will return an array like:
search_term.split(🤷‍♂️)
=> ["pizza", "cheese", "ham and pineapple", "pepperoni"]

Yes:
/"[^"]*?"|\w+/
https://regex101.com/r/fzHI4g/2
Not done as a split. Just take stuff in quotes, or single words...each one is a match.
£ cat pizza
pizza "a and b" pie
£ ruby -ne 'print $_.scan(/"[^"]*?"|\w+/)' pizza
["pizza", "\"a and b\"", "pie"]
£
so...search_term.scan(/regex/) seems to return the array you want.
To exclude the quotes you need:
This puts the quotes in lookarounds which assert that the matched expression has a quote before it (lookbehind), and a quote after it (lookahead) rather than containing the quotes.
/(?<=")\w[^"]*?(?=")|\w+/
Note that because the last regex doesn't consume the quotes, it uses whitespace to determine beginning vs. ending quotes so " a bear" is not ok. This can be solved with capture groups, but if this is an issue, like I said in the comments, I would recommend just trimming quotes off each array element and using the regex at the top of the answer.

r = /
(?<=\") # match a double quote in a positive lookbehind
(?!\s) # next char cannot be a whitespace, negative lookahead
[^"]+ # match one or more characters other than double-quote
(?<!\s) # previous char cannot be a whitespace, negative lookbehind
(?=\") # match a double quote in a positive lookahead
| # or
\w+ # match one or more word characters
/x # free-spacing regex definition mode
str = 'pizza "ham and pineapple" mushroom pepperoni "sausage and anchovies"'
str.scan r
#=> ["pizza", "ham and pineapple", "mushroom", "pepperoni",
# "sausage and anchovies"]

Regex: Match all hyphens or underscores not at the beginning or the end of the string

I am writing some code that needs to convert a string to camel case. However, I want to allow any _ or - at the beginning of the code.
I have had success matching up an _ character using the regex here:
^(?!_)(\w+)_(\w+)(?<!_)$
when the inputs are:
pro_gamer #matched
#ignored
_proto
proto_
__proto
proto__
__proto__
#matched as nerd_godess_of, skyrim
nerd_godess_of_skyrim
I recursively apply my method on the first match if it looks like nerd_godess_of.
I am having troubled adding - matches to the same, I assumed that just adding a - to the mix like this would work:
^(?![_-])(\w+)[_-](\w+)(?<![_-])$
and it matches like this:
super-mario #matched
eslint-path #matched
eslint-global-path #NOT MATCHED.
I would like to understand why the regex fails to match the last case given that it worked correctly for the _.
The (almost) full set of test inputs can be found here

The fact that
^(?![_-])(\w+)[_-](\w+)(?<![_-])$
does not match the second hyphen in "eslint-global-path" is because of the anchor ^ which limits the match to be on the first hyphen only. This regex reads, "Match the beginning of the line, not followed by a hyphen or underscore, then match one or more words characters (including underscores), a hyphen or underscore, and then one or more word characters in a capture group. Lastly, do not match a hyphen or underscore at the end of the line."
The fact that an underscore (but not a hyphen) is a word (\w) character completely messes up the regex. In general, rather than using \w, you might want to use \p{Alpha} or \p{Alnum} (or POSIX [[:alpha:]] or [[:alnum:]]).
Try this.
r = /
(?<= # begin a positive lookbehind
[^_-] # match a character other than an underscore or hyphen
) # end positive lookbehind
( # begin capture group 1
(?: # begin a non-capture group
-+ # match one or more hyphens
| # or
_+ # match one or more underscores
) # end non-capture group
[^_-] # match any character other than an underscore or hyphen
) # end capture group 1
/x # free-spacing regex definition mode
'_cats_have--nine_lives--'.gsub(r) { |s| s[-1].upcase }
#=> "_catsHaveNineLives--"
This regex is conventionally written as follows.
r = /(?<=[^_-])((?:-+|_+)[^_-])/
If all the letters are lower case one could alternatively write
'_cats_have--nine_lives--'.split(/(?<=[^_-])(?:_+|-+)(?=[^_-])/).
map(&:capitalize).join
#=> "_catsHaveNineLives--"
where
'_cats_have--nine_lives--'.split(/(?<=[^_-])(?:_+|-+)(?=[^_-])/)
#=> ["_cats", "have", "nine", "lives--"]
(?=[^_-]) is a positive lookahead that requires the characters on which the split is made to be followed by a character other than an underscore or hyphen

you can try the regex
^(?=[^-_])(\w+[-_]\w*)+(?=[^-_])\w$
see the demo here.

Switch _- to -_ so that - is not treated as a range op, as in a-z.

Capitalize the first character after a dash

So I've got a string that's an improperly formatted name. Let's say, "Jean-paul Bertaud-alain".
I want to use a regex in Ruby to find the first character after every dash and make it uppercase. So, in this case, I want to apply a method that would yield: "Jean-Paul Bertaud-Alain".
Any help?

String#gsub can take a block argument, so this is as simple as:
str = "Jean-paul Bertaud-alain"
str.gsub(/-[a-z]/) {|s| s.upcase }
# => "Jean-Paul Bertaud-Alain"
Or, more succinctly:
str.gsub(/-[a-z]/, &:upcase)
Note that the regular expression /-[a-z]/ will only match letters in the a-z range, meaning it won't match e.g. à. This is because String#upcase does not attempt to capitalize characters with diacritics anyway, because capitalization is language-dependent (e.g. i is capitalized differently in Turkish than in English). Read this answer for more information: https://stackoverflow.com/a/4418681

"Jean-paul Bertaud-alain".gsub(/(?<=-)\w/, &:upcase)
# => "Jean-Paul Bertaud-Alain"

I suggest you make the test more demanding by requiring the letter to be upcased: 1) be preceded by a capitalized word followed by a hypen and 2) be followed by lowercase letters followed by a word break.
r = /
\b # Match a word break
[A-Z] # Match an upper-case letter
[a-z]+ # Match >= 1 lower-case letters
\- # Match hypen
\K # Forget everything matched so far
[a-z] # Match a lower-case letter
(?= # Begin a positive lookahead
[a-z]+ # Match >= 1 lower-case letters
\b # Match a word break
) # End positive lookahead
/x # Free-spacing regex definition mode
"Jean-paul Bertaud-alain".gsub(r) { |s| s.upcase }
#=> "Jean-Paul Bertaud-Alain"
"Jean de-paul Bertaud-alainM".gsub(r) { |s| s.upcase }
#=> "Jean de-paul Bertaud-alainM"

What does the regular expression [\w-] mean?

I checked the documentation, and cannot find what [\w-] means. Can anyone tell me what [\w-] means in Ruby?

The square brackets [] denote a character class. A character class will match any of the things inside it.
\w is a special class called "word characters". It is shorthand for [a-zA-Z0-9_], so it will match:
a-z (all lowercase letters)
A-Z (all uppercase letters)
0-9 (all digits)
_ (an underscore)
The class you are asking about, [\w-], is a class consisting of \w and -. So it will match the above list, plus hyphens (-).
Exactly as written, [\w-], this regex would match a single character, as long as it's in the above list, or is a dash.
If you were to add a quantifier to the end, e.g. [\w-]* or [\w-]+, then it would match any of these strings:
fooBar9
foo-Bar9
foo-Bar-9
-foo-Bar---9abc__34ab12d
And it would partially match these:
foo,Bar9 # match 'foo' - the ',' stops the match
-foo-Bar---9*bc__34ab12d # match '-foo-Bar---9', the '*' stops the match

\w Any word character (letter, number, underscore)
Here is what I think it is doing : Go to Rubular and try it as follow:
regex_1 /\w-/
String : f-oo
regext_1 will only match f- and will stop right at - ignoring any \w .. the rest of the string oo
Whereas :
regex_2 /[\w-]/
string : f-oo
regex_2 will match the entire string plus the special char - .. f-oo
.. Also , tested the case of a string being like f-1oo , and the second regex stopped the match at f- Hence, - is followed by a \d
==========
I belive the whole point of [] is to continue matching before and after - . Here are some variations I tried from irb.
irb(main):004:0> "blah-blah".scan(/\w-/)
=> ["h-"]
irb(main):005:0> "blah-blah".scan(/[\w-]/)
=> ["b", "l", "a", "h", "-", "b", "l", "a", "h"]
irb(main):006:0> "blah-blah".scan(/\w-\w/)
=> ["h-b"]
irb(main):007:0> "blah-blah".scan(/\w-\w*$/)
=> ["h-blah"]
irb(main):008:0> "blah-blah".scan(/\w*-\w*$/)
=> ["blah-blah"]

Ruby Regular expression not matching properly

I am trying to creat a RegEx to find words that contains any vowel.
so far i have tried this
/(.*?\S[aeiou].*?[\s|\.])/i
but i have not used RegEx much so its not working properly.
for example if i input "test is 1234 and sky fly test1234"
it should match test , is, and, test1234 but showing
test, is,1234 and
if put something else then different output.

Alternatively you can also do something like:
"test is 1234 and sky fly test1234".split.find_all { |a| a =~ /[aeiou]/ }
# => ["test", "is", "and", "test1234"]

You could use the below regex.
\S*[aeiou]\S*
\S* matches zero or more non-space characters.
or
\w*[aeiou]\w*

It will solve:
\b\w*[aeiou]+\w*\b
https://www.debuggex.com/r/O-fU394iC5ErcSs7
or you can substitute \w by \S
\b\S*[aeiou]+\S*\b
https://www.debuggex.com/r/RNE6Y6q1q5yPJbe-
\b - a word boundary
\w - same as [_a-zA-Z0-9]
\S - a non-whitespace character

Try this:
\b\w*[aeiou]\w*\b
\b denotes a word boundry, so this regexp matches word bounty, zero or more letters, a vowel, zero or more letters and another word boundry

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Split string on capital letters, but not if preceded by whitespace - ruby

my_string.split(/(?<!\s)(?=[A-Z])/) This matches positions that are not preceded by a whitespace (negative lookbehind - (?<!\s)) and are followed by a capital letter (positive lookahead - (?=[A-Z])).

Related

Match exact phrase and words in regex

Regex: Match all hyphens or underscores not at the beginning or the end of the string

Capitalize the first character after a dash

What does the regular expression [\w-] mean?

Ruby Regular expression not matching properly

Categories

Resources