Regex: match word breaks that are not "=" - ruby

How does one match all \b that are not "="?
"igloo".match(...) # => `igloo`
"igloo=".match(...) # => `nil`

First, \b doesn't match '='; it matches on the boundary between '=' and something else. To match only when the other side of the boundary is not '=', use a negative lookahead:
rx = /igloo\b(?!=)/
"igloo".match(rx) => #<MatchData "igloo">
"igloo=".match(rx) => nil
That says "match a \b boundary, but only when not followed by '='".

Related

How can I use regex in Ruby to split a string into an array of the words it contains?

I am trying to create a regex pattern that will split a string into an array of words based on many different patterns and conventions. The rules are as follows:
It must split the string on all dashes, spaces, underscores, and periods.
When multiple of the aforementioned characters show up together, it must only split once (so 'the--.quick' must split to ['the', 'quick'] and not ['the', '', '', 'quick'] )
It must split the string on new capital letters, while keeping that letter with its corresponding word ('theQuickBrown' splits to ['the', 'quick', 'brown']
It must group multiple uppercase letters in a row together ('LETS_GO' must split to ['lets', 'go'], not ['l', 'e', 't', 's', 'g', 'o'])
It must use only lowercase letters in the split array.
If it is working properly, the following should be true
"theQuick--brown_fox JumpsOver___the.lazy DOG".split_words ==
["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
So far, I have been able to get almost there, with the only issue being that it splits on every capital, so "DOG".split_words is ["d", "o", "g"] and not ["dog"]
I also use a combination of regex and maps/filters on the split array to get to the solution, bonus points if you can tell me how to get rid of that and use only regex.
Here's what I have so far:
class String
def split_words
split(/[_,\-, ,.]|(?=[A-Z]+)/).
map(&:downcase).
reject(&:empty?)
end
end
Which when called on the string from the test above returns:
["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "d", "o", "g"]
How can I update this method to meet all of the above specs?
You can slightly change the regex so it doesn't split on every capital, but every sequence of letters that starts with a capital. This just involves putting a [a-z]+ after the [A-Z]+
string = "theQuick--brown_fox JumpsOver___the.lazy DOG"
regex = /[_,\-, ,.]|(?=[A-Z]+[a-z]+)/
string.split(regex).reject(&:empty?)
# => ["the", "Quick", "brown", "fox", "Jumps", "Over", "the", "lazy", "DOG"]
You may use a matching approach to extract chunks of 2 or more uppercase letters or a letter followed only with 0+ lowercase letters:
s.scan(/\p{Lu}{2,}|\p{L}\p{Ll}*/).map(&:downcase)
See the Ruby demo and the Rubular demo.
The regex matches:
\p{Lu}{2,} - 2 or more uppercase letters
| - or
\p{L} - any letter
\p{Ll}* - 0 or more lowercase letters.
With map(&:downcase), the items you get with .scan() are turned to lower case.
r = /
[- _.]+ # match one or more combinations of dashes, spaces,
# underscores and periods
| # or
(?<=\p{Ll}) # match a lower case letter in a positive lookbehind
(?=\p{Lu}) # match an upper case letter in a positive lookahead
/x # free-spacing regex definition mode
str = "theQuick--brown_dog, JumpsOver___the.--lazy FOX for $5"
str.split(r).map(&:downcase)
#=> ["the", "quick", "brown", "dog,", "jumps", "over", "the", "lazy",
"fox", "for", "$5"]
If the string is to be broken on spaces and all punctuation characters, replace [- _.]+ with [ [:punct:]]+. Search for "[[:punct:]]" at Regexp for the reference.

How do I write a regex that matches the beginning of the line or NOT a character?

I'm using Ruby 2.4. How do I write a regular expression in which matches something where the last character is a dash and the preceding character is not a dash or the beginning of the line. So this expression shoudl match
"-"
as shoudl
"ab-"
but this should not
"---"
I tried the below but I'm not matching anything
2.4.0 :012 > word = "abc-"
=> "abc-"
2.4.0 :013 > word =~ /(^|\^\-)\-$/
=> nil
Here is my go at it:
regex = /[^-\A]-\z/
%w(- ab- ---).map { |s| s =~ regex }
=> [nil, 1, nil]
Not 100% sure I got your requirements right, but this does seem to do the trick:
regex = /(^|(?!-).*)-$/
%w(- ab- ---).map { |s| s =~ regex }
#=> [0, 0, nil]
Check it out on Rubular with some test cases.

Scanning through a hash and return a value if true

Based on my hash, I want to match it if it's in the string:
def conv
str = "I only have one, two or maybe sixty"
hash = {:one => 1, :two => 2, :six => 6, :sixty => 60 }
str.match( Regexp.union( hash.keys.to_s ) )
end
puts conv # => <blank>
The above does not work but this only matches "one":
str.match( Regexp.union( hash[0].to_s ) )
Edited:
Any idea how to match "one", "two" and sixty in the string exactly?
If my string has "sixt" it return "6" and that should not happen based on #Cary's answer.
You need to convert each element of hash.keys to a string, rather than converting the array hash.keys to a string, and you should use String#scan rather than String#match. You may also need to play around with the regex until it returns everyhing you want and nothing you don't want.
Let's first look at your example:
str = "I only have one, two or maybe sixty"
hash = {:one => 1, :two => 2, :six => 6, :sixty => 60}
We might consider constructing the regex with word breaks (\b) before and after each word we wish to match:
r0 = Regexp.union(hash.keys.map { |k| /\b#{k.to_s}\b/ })
#=> /(?-mix:\bone\b)|(?-mix:\btwo\b)|(?-mix:\bsix\b)|(?-mix:\bsixty\b)/
str.scan(r0)
#=> ["one", "two", "sixty"]
Without the word breaks, scan would return ["one", "two", "six"], as "sixty" in str would match "six". (Word breaks are zero-width. One before a string requires that the string be preceded by a non-word character or be at the beginning of the string. One after a string requires that the string be followed by a non-word character or be at the end of the string.)
Depending on your requirements, word breaks may not be sufficient or suitable. Suppose, for example (with hash above):
str = "I only have one, two, twenty-one or maybe sixty"
and we do not wish to match "twenty-one". However,
str.scan(r0)
#=> ["one", "two", "one", "sixty"]
One option would be to use a regex that demands that matches be preceded by whitespace or be at the beginning of the string, and be followed by whitespace or be at the end of the string:
r1 = Regexp.union(hash.keys.map { |k| /(?<=^|\s)#{k.to_s}(?=\s|$)/ })
str.scan(r1)
#=> ["sixty"]
(?<=^|\s) is a positive lookbehind; (?=\s|$) is a positive lookahead.
Well, that avoided the match of "twenty-one" (good), but we no longer matched "one" or "two" (bad) because of the comma following each of those words in the string.
Perhaps the solution here is to first remove punctuation, which allows us to then apply either of the above regexes:
str.tr('.,?!:;-','')
#=> "I only have one two twentyone or maybe sixty"
str.tr('.,?!:;-','').scan(r0)
#=> ["one", "two", "sixty"]
str.tr('.,?!:;-','').scan(r1)
#=> ["one", "two", "sixty"]
You may also want to change / at the end of the regex to /i to make the match insensitive to case.1
1 Historical note for readers who want to know why 'a' is called lower case and 'A' is called upper case.

Why does sub replace only one character with a regex?

I would like to strip all non-digit characters from a string.
/\D/ is a non-digit character ([^0-9]):
irb(main):010:0> s = "(123) 456-7890"
=> "(123) 456-7890"
irb(main):011:0> s.sub( /\D*/, '' )
=> "123) 456-7890"
Do as below using String#tr or String#gsub:
s.gsub(/[[:punct:]]|[[:space:]]/ ,'')
# => "1234567890"
s.tr('^0-9','') # even more faster
# => "1234567890"
sub replaces once. gsub replaces all.
Use gsub instead:
s.gsub( /\D/, '' )

Replace with uppercase characters with gsub

I have nubmers of simple strings 'som', 'man', 'pal', etc
How do i make vowel character upcase!, having vowel regex or array to have output like 'sOm', 'pAl', 'mAn' ?
"som".gsub(/[aeiou]/, &:upcase)
# => "sOm"
or
"som".tr("aeiou", "AEIOU")
# => "sOm"

Resources