How can I improve this small Ruby Regex snippet? - ruby

How can I improve this?
the purpose of this code is to be used in a method that captures a string of hash_tags #twittertype from a form - parse through the list of words and make sure all the words are separated out.
WORD_TEST = "123 sunset #2d2-apple,#home,#star #Babyclub, #apple_surprise #apple,cats mustard#dog , #basic_cable safety #222 #dog-D#DOG#2D "
SECOND_TEST = 'orion#Orion#oRion,Mike'
This is my problem area RegXps...
_string_rgx = /([a-zA-Z0-9]+(-|_)?\w+|#?[a-zA-Z0-9]+(-|_)?\w+)/
add_pound_sign = lambda { |a| a[0].chr == '#' ? a : a='#' + a; a}
I don't know that much Regular Expressions: hence the needed collect the first[element] from the result of the scan -> It yielded weird stuff but the first element was always what I wanted.
t_word = WORD_TEST.scan(_string_rgx).collect {|i| i[0] }
s_word = SECOND_TEST.scan(_string_rgx).collect {|i| i[0] }
t_word.map! { |a| a = add_pound_sign.call(a); a }
s_word.map! { |a| a = add_pound_sign.call(a); a }
The results are what I want. I just want insight from Ruby | Regex guru's out there.
puts t_word.inspect
[
"#123", "#sunset", "#2d2-apple", "#home", "#star", "#Babyclub",
"#apple_surprise", "#apple", "#cats", "#mustard", "#dog",
"#basic_cable", "#safety", "#222", "#dog-D", "#DOG", "#2D"
]
puts s_word.inspect
[
"#orion", "#Orion", "#oRion", "#Mike"
]
Thanks in advance.

Lets unfold the regex:
(
[a-zA-Z0-9]+ (-|_)? \w+
| #? [a-zA-Z0-9]+ (-|_)? \w+
)
( begin capture group
[a-zA-Z0-9]+ match one or more alphanumeric characters
(-|_)? match a hyphen or an underscore and save. This group may fail
\w+ match one or more "word" characters (alphanumeric + underscore)
| OR match this:
#? match optional # character
[a-zA-Z0-9]+ match one or more alphanumeric characters
(-|_)? match hyphen or underscore and capture. may fail.
\w+ match one or more word characters
) end capature
I'd rather write this regex like this;
(#? [a-zA-Z0-9]+ (-|_)? \w+)
or
( #? [a-zA-Z0-9]+ (-?\w+)? )
or
( #? [a-zA-Z0-9]+ -? \w+ )
(all are reasonably equivalent)
You should note that this regex will fail on hashtags with unicode characters, eg #Ü-Umlaut, #façadeetc. You are also limited to a two-character minimum length (#a fails, #ab matches) and may have only one hyphen (#a-b-c fails / would return #a-b)

I would reduce your Regex pattern such as this:
WORD_TEST = "123 sunset #2d2-apple,#home,#star #Babyclub, #apple_surprise #apple,cats mustard#dog , #basic_cable safety #222 #dog-D#DOG#2D "
foo = []
WORD_TEST.scan(/#?[-\w]+\b/) do |s|
foo.push( s[0] != '#' ? '#' + s : s )
end

Related

Matching strings that contain a letter with the first character not being a number

How do I write a regular expression that has at least one letter, but the first character must not be a number? I tried this
str = "a"
str =~ /^[^\d][[:space:]]*[a-z]*/i
# => 0
str = "="
str =~ /^[^\d][[:space:]]*[a-z]*/i
# => 0
The "=" is matched even though it contains no letters. I expect the"a"to match, and similarly a string like"3abcde"` should not match.
The [a-z]* and [[:space:]]* patterns can match an empty string, so they do not really make any difference when validating is necessary. Also, = is not a digit, it is matched with [^\d] negated character class that is a consuming type of pattern. It means it requires a character other than a digit in the string.
You may rely on a lookahead that will restrict the start of string position:
/\A(?!\d).*[a-z]/im
Or even a bit faster and Unicode-friendly version:
/\A(?!\d)\P{L}*\p{L}/
See the regex demo
Details:
\A - start of a string
(?!\d) - the first char cannot be a digit
\P{L}* - 0 or more (*) chars other than letters
or
.* - any 0+ chars, including line breaks if /m modifier is used)
\p{L} - a letter
The m modifier enables the . to match line break chars in a Ruby regex.
Use [a-z] when you need to restrict the letters to those in ASCII table only. Also, \p{L} may be replaced with [[:alpha:]] and \P{L} with [^[:alpha:]].
If two regular expressions were permitted you could write:
def pass_de_test?(str)
str[0] !~ /\d/ && str =~ /[[:alpha]]/
end
pass_de_test?("*!\n?a>") #=> 4 (truthy)
pass_de_test?("3!\n?a>") #=> false
If you want true or false returned, change the operative line to:
str[0] !~ /\d/ && str =~ /[[:alpha]]/) ? true : false
or
!!(str[0] !~ /\d/ && str =~ /[[:alpha]]/)

Replacing hyphens in words with the next letter capitalized

I have a symbol like the following. Whenever the symbol contains the "-" hyphen mark, I want to remove it and upcase the subsequent letter.
I am able to do it like so:
sym = :'new-york'
str = sym.to_s.capitalize
/-(.)/.match(str)
str = str.gsub(/-(.)/,$1.capitalize)
=> "NewYork"
This required four lines. Is there a more elegant way to create CamelCase (upper CamelCase e.g. NewYork, NewJersey, BucksCounty) from hyphened words in Ruby?
Here's one way:
sym.to_s.split('-').map(&:capitalize).join #=> "NewYork"
sym.to_s.gsub(/(-|\A)./) { $&[-1].upcase }
or
sym.to_s.gsub(/(-|\A)./) { |m| m[-1].upcase }
r = /
([[:alpha:]]+) # match 1 or more letters in capture group 1
- # match a hyphen
([[:alpha:]]+) # match 1 or more letters in capture group 2
/x # free-spacing regex definition mode
sym = :'new-york'
sym.to_s.sub(r) { $1.capitalize + $2.capitalize }
#=> "NewYork"

Regex condition on first and last characters

How can I write a regex to match a string that does not start or end with a white space character? A matching string can have any character in the middle, and importantly, a single-character string should match.
My attempt was:
/\A\S.*\S\z/
but this will not match a single character.
This is one of the cases where you should not attempt to build a regex that matches something, but rather a regex that matches the complement of something, and use the regex negatively.
re = /\A\s|\s\z/
re !~ " " # => false
re !~ "" # => true
re !~ "sss" # => true
re !~ "s ss" # => true
re !~ " s ss" # => false
is_ok = lambda do |str|
a, z = str.chars.first, str.chars.last
"#{a}#{z}" =~ / |\n|\t/ ? false : true
end
#"more elegant" (yeah dude I rock)
is_ok = lambda {|str| [0, -1].map{|i| str.chars[i] }.join =~ / |\n|\t/ ? false : true}
Use this regex:
\A\S+(?:\s*\S+)*\Z
You can play with the Test String part of this demo to see how this works. I'm assuming that strings can span multiple lines, hence the \A and \Z
In Ruby, something like:
if subject =~ /\A\S+(?:\s*\S+)*\Z/
match = $&
Explanation
The \A anchor asserts that we are at the beginning of the subject string
\S+ matches one or more non-whitespace characters (including tabs, newlines etc.) Alternaltely, if you want to allow newlines at the beginning but only want to exclude a space character, you can use [^ ]+ instead of \S+
(?:\s*\S+) matches any number of optional whitespace characters, followed by one or more non-space characters
The * quantifier repeats that zero or more times
The \Z anchor asserts that we are at the end of the subject string
Use lookaheads, like this:
\A(?=\S).*\S\Z
Regex101 Demo
This matches the start of the string and requires (1) that the first character be a non-whitespace character and (2) that the last character be a non-whitespace character.
Matches:
a
a b
a b c d 1231 e
Non matches:
(just a space)
a (leading space)
b (trailing space)
empty string

Why doesn't scan(/\w/) count the same as gsub(/\s+/, '').length?

I am trying to count the characters in a text file excluding white spaces. My thought was to use scan; however, the tutorial I am reading uses gsub. There is a difference in output between the two, and I was wondering why. Here are the two code blocks; the gsub version is the one that's giving me the correct output:
total_characters_nospaces = text.gsub(/\s+/, '').length
puts "#{total_characters_nospaces} characters excluding spaces."
And the other one:
chars = 0
totes_chars_no = text.scan(/\w/){|everything| chars += 1 }
puts chars
The opposite of \s is not \w - it is \S.
\w is equivalent to [a-zA-Z0-9_]. It does not include many other characters such as punctuation.
\S is the exact opposite of \s - it includes any character that is not whitespace.
Now that your question has been answered, here are a couple other ways you could do it:
s = "now is the time for all good"
s.count "^\s" # => 22
s.each_char.reduce(0) { |count, c| count + (c =~ /\S/ ? 1 : 0) } # => 22

How to find string which is started with ".."?

I was trying to find strings out which is followed by only "..",but couldn't get that :
["..ab","...cc","..ps","....kkls"].each do |x|
puts x if /../.match(x)
end
..ab
...cc
..ps
....kkls
=> ["..ab", "...cc", "..ps", "....kkls"]
["..ab","...cc","..ps","....kkls"].each do |x|
puts x if /(.)(.)/.match(x)
end
..ab
...cc
..ps
....kkls
=> ["..ab", "...cc", "..ps", "....kkls"]
Expected output:
["..ab","..ps"]
What you want is
/^\.\.(?!\.)/
The caret ^ at the beginning means match the beginning of the string; periods must be escaped by a backslash as \. because in regular expressions a plain period . matches any character; the (?!\.) is a negative look-ahead meaning the next character is not a period. So the expression means, "at the beginning of the string, match two periods, which must be followed by a character which is not a period."
>> /^\.\.(?!\.)/.match "..ab"
=> #<MatchData "..">
>> /^\.\.(?!\.)/.match "...cc"
=> nil
Try selecting on /^\.\.[^\.]/ (starts with two dots and then not a dot).
ss = ["..ab","...cc","..ps","....kkls"]
ss.select { |x| x =~ /^\.\.[^\.]/ } # => ["..ab", "..ps"]
Try using /^\.{2}\w/ as the regular expression.
A quick explanation:
^ means the start of the string. Without this, it can match dots that are found in the middle of the string.
\. translates to . -- if you use the dot on its own, it will match any non-newline character
{2} means that you're looking for two of the dots. (you could rewrite /\.{2}/ as /\.\./)
Finally, the \w matches any word character (letter, number, underscore).
A really good place to test Ruby regular expressions is http://rubular.com/ -- it lets you play with the regex and test it right in your browser.
You don't need regex for this at all, you can just extract the appropriate leading chunks using String#[] or String#slice and do simple string comparisons:
>> a = ["..ab", "...cc", "..ps", "....kkls", ".", "..", "..."]
>> a.select { |s| s[0, 2] == '..' && s[0, 3] != '...' }
=> ["..ab", "..ps", ".."]
Maybe this:
["..ab","...cc","..ps","....kkls"].each {|x| puts x if /^\.{2}\w/.match(x) }
Or if you want to make sure the . doesn't match:
["..ab","...cc","..ps","....kkls"].each {|x| puts x if /^\.{2}[^\.]/.match(x) }

Resources