Why doesn't scan(/\w/) count the same as gsub(/\s+/, '').length? - ruby

I am trying to count the characters in a text file excluding white spaces. My thought was to use scan; however, the tutorial I am reading uses gsub. There is a difference in output between the two, and I was wondering why. Here are the two code blocks; the gsub version is the one that's giving me the correct output:
total_characters_nospaces = text.gsub(/\s+/, '').length
puts "#{total_characters_nospaces} characters excluding spaces."
And the other one:
chars = 0
totes_chars_no = text.scan(/\w/){|everything| chars += 1 }
puts chars

The opposite of \s is not \w - it is \S.
\w is equivalent to [a-zA-Z0-9_]. It does not include many other characters such as punctuation.
\S is the exact opposite of \s - it includes any character that is not whitespace.

Now that your question has been answered, here are a couple other ways you could do it:
s = "now is the time for all good"
s.count "^\s" # => 22
s.each_char.reduce(0) { |count, c| count + (c =~ /\S/ ? 1 : 0) } # => 22

Related

How to find same characters in two random strings? (Ruby)

I am busy working through some problems I have found on the net and I feel like this should be simple but I am really struggling.
Say you have the string 'AbcDeFg' and the next string of 'HijKgLMnn', I want to be able to find the same characters in the string so in this case it would be 'g'.
Perhaps I wasn't giving enough information - I am doing Advent of Code and I am on day 3. I just need help with the first bit which is where you are given a string of characters - you have to split the characters in half and then compare the 2 strings. You basically have to get the common character between the two. This is what I currently have:
file_data = File.read('Day_3_task1.txt')
arr = file_data.split("\n")
finals = []
arr.each do |x|
len = x.length
divided_by_two = len / 2
second = x.slice!(divided_by_two..len).split('')
first = x.split('')
count = 0
(0..len).each do |z|
first.each do |y|
if y == second[count]
finals.push(y)
end
end
count += 1
end
end
finals = finals.uniq
Hope that helps in terms of clarity :)
Did you try to convert both strings to arrays with the String#char method and find the intersection of those arrays?
Like this:
string_one = 'AbcDeFg'.chars
string_two = 'HijKgLMnn'.chars
string_one & string_two # => ["g"]
One way to do that is to use the method String#scan with the regular expression
rgx = /(.)(?!.*\1.*_)(?=.*_.*\1)/
I'm not advocating this approach. I merely thought some readers might find it interesting.
Suppose
str1 = 'AbcDgeFg'
str2 = 'HijKgLMnbn'
Now form the string
str = "#{str1}_#{str2}"
#=> "AbcDeFg_HijKgLMnbn"
I've assumed the strings contain letters only, in which case they are separated in str with any character other than a letter. I've used an underscore. Naturally, if the strings could contain underscores a different separator would have to be used.
We then compute
str.scan(rgx).flatten
#=> ["b", "g"]
Array#flatten is needed because
str.scan(rgx)
#=>[["b"], ["g"]]
The regular expression can be written in free-spacing mode to make it self-documenting:
rgx =
/
(.) # match any character, same to capture group 1
(?! # begin a negative lookahead
.* # match zero or more characters
\1 # match the contents of capture group 1
.* # match zero or more characters
_ # match an underscore
) # end the negative lookahead
(?= # begin a positive lookahead
.* # match zero or more characters
_ # match an underscore
.* # match zero or more characters
\1 # match the contents of capture group 1
) # end the positive lookahead
/x # invoke free-spacing regex definition mode
Note that if a character appears more than once in str1 and at least once in str2 the negative lookahead ensures that only the last one in str1 is matched, to avoid returning duplicates.
Alternatively, one could write
str.gsub(rgx).to_a
The uses the (fourth) form of String#gsub which takes a single argument and no block and returns an enumerator.

How to write a method that takes in a number and returns a string placing a single hyphen before and after each odd number?

The one exception is that the returned string cannot begin or end with a hyphen, and each odd digit is permitted only a single hyphen before and after each odd digit. For example:
def hyphenate(number)
# code
end
hyphenate(132237847) # should return "1-3-22-3-7-84-7"
"-1-3-22-3-7-84-7-" # incorrect because there is a hyphen before and after
# each beginning and ending odd digit respectively.
"1--3-22-3--7-84-7" # Also incorrect because there is more than one
# single hyphen before and after each odd digit
I suggest to match a non-word boundary \B (that will match a position between two digits) followed or preceded with an odd digit:
number.to_s.gsub(/\B(?=[13579])|\B(?<=[13579])/, '-')
Since the same position can't be matched twice, you avoid the problem of consecutive hyphens.
rubular demo
with the replacement
A simple way is to convert the number to a string, String#split the string on odd digits (using a group so that the odd digit delimiters get into the output), clean up the stray '' strings that String#split will produce, and put it back together with Array#join:
number.to_s.split(/([13579])/).reject(&:empty?).join('-')
def hyphenate(number)
test_string = ''
# Convert the number to a string then iterate over each character
number.to_s.each_char do |n|
# If the number is divisible by 2 then just add it to the string
# else it is an odd number then add it with the hyphens
n.to_i % 2 == 0 ? test_string += n : test_string += "-#{n}-"
end
# Remove the first character of the string if it is a hyphen
test_string = test_string[1..-1] if test_string.start_with?('-')
# Remove the last character of the string if it is a hyphen
test_string = test_string[0..-2] if test_string.end_with?('-')
# Return the string and replace all double hyphens with a single hyphen
test_string.gsub('--', '-')
end
puts hyphenate(132237847)
Returns "1-3-22-3-7-84-7"
Here's another approach for taking a number and returning it in string form with its odd digits surrounded by hyphens:
def hyphenate(number)
result = ""
number.digits.reverse.each do |digit|
result << (digit.odd? ? "-#{digit}-" : digit.to_s)
end
result.gsub("--", "-").gsub(/(^-|-$)/, "")
end
hyphenate(132237847)
# => "1-3-22-3-7-84-7"
Hope it helps!

Matching strings that contain a letter with the first character not being a number

How do I write a regular expression that has at least one letter, but the first character must not be a number? I tried this
str = "a"
str =~ /^[^\d][[:space:]]*[a-z]*/i
# => 0
str = "="
str =~ /^[^\d][[:space:]]*[a-z]*/i
# => 0
The "=" is matched even though it contains no letters. I expect the"a"to match, and similarly a string like"3abcde"` should not match.
The [a-z]* and [[:space:]]* patterns can match an empty string, so they do not really make any difference when validating is necessary. Also, = is not a digit, it is matched with [^\d] negated character class that is a consuming type of pattern. It means it requires a character other than a digit in the string.
You may rely on a lookahead that will restrict the start of string position:
/\A(?!\d).*[a-z]/im
Or even a bit faster and Unicode-friendly version:
/\A(?!\d)\P{L}*\p{L}/
See the regex demo
Details:
\A - start of a string
(?!\d) - the first char cannot be a digit
\P{L}* - 0 or more (*) chars other than letters
or
.* - any 0+ chars, including line breaks if /m modifier is used)
\p{L} - a letter
The m modifier enables the . to match line break chars in a Ruby regex.
Use [a-z] when you need to restrict the letters to those in ASCII table only. Also, \p{L} may be replaced with [[:alpha:]] and \P{L} with [^[:alpha:]].
If two regular expressions were permitted you could write:
def pass_de_test?(str)
str[0] !~ /\d/ && str =~ /[[:alpha]]/
end
pass_de_test?("*!\n?a>") #=> 4 (truthy)
pass_de_test?("3!\n?a>") #=> false
If you want true or false returned, change the operative line to:
str[0] !~ /\d/ && str =~ /[[:alpha]]/) ? true : false
or
!!(str[0] !~ /\d/ && str =~ /[[:alpha]]/)

How to find string which is started with ".."?

I was trying to find strings out which is followed by only "..",but couldn't get that :
["..ab","...cc","..ps","....kkls"].each do |x|
puts x if /../.match(x)
end
..ab
...cc
..ps
....kkls
=> ["..ab", "...cc", "..ps", "....kkls"]
["..ab","...cc","..ps","....kkls"].each do |x|
puts x if /(.)(.)/.match(x)
end
..ab
...cc
..ps
....kkls
=> ["..ab", "...cc", "..ps", "....kkls"]
Expected output:
["..ab","..ps"]
What you want is
/^\.\.(?!\.)/
The caret ^ at the beginning means match the beginning of the string; periods must be escaped by a backslash as \. because in regular expressions a plain period . matches any character; the (?!\.) is a negative look-ahead meaning the next character is not a period. So the expression means, "at the beginning of the string, match two periods, which must be followed by a character which is not a period."
>> /^\.\.(?!\.)/.match "..ab"
=> #<MatchData "..">
>> /^\.\.(?!\.)/.match "...cc"
=> nil
Try selecting on /^\.\.[^\.]/ (starts with two dots and then not a dot).
ss = ["..ab","...cc","..ps","....kkls"]
ss.select { |x| x =~ /^\.\.[^\.]/ } # => ["..ab", "..ps"]
Try using /^\.{2}\w/ as the regular expression.
A quick explanation:
^ means the start of the string. Without this, it can match dots that are found in the middle of the string.
\. translates to . -- if you use the dot on its own, it will match any non-newline character
{2} means that you're looking for two of the dots. (you could rewrite /\.{2}/ as /\.\./)
Finally, the \w matches any word character (letter, number, underscore).
A really good place to test Ruby regular expressions is http://rubular.com/ -- it lets you play with the regex and test it right in your browser.
You don't need regex for this at all, you can just extract the appropriate leading chunks using String#[] or String#slice and do simple string comparisons:
>> a = ["..ab", "...cc", "..ps", "....kkls", ".", "..", "..."]
>> a.select { |s| s[0, 2] == '..' && s[0, 3] != '...' }
=> ["..ab", "..ps", ".."]
Maybe this:
["..ab","...cc","..ps","....kkls"].each {|x| puts x if /^\.{2}\w/.match(x) }
Or if you want to make sure the . doesn't match:
["..ab","...cc","..ps","....kkls"].each {|x| puts x if /^\.{2}[^\.]/.match(x) }

How can I improve this small Ruby Regex snippet?

How can I improve this?
the purpose of this code is to be used in a method that captures a string of hash_tags #twittertype from a form - parse through the list of words and make sure all the words are separated out.
WORD_TEST = "123 sunset #2d2-apple,#home,#star #Babyclub, #apple_surprise #apple,cats mustard#dog , #basic_cable safety #222 #dog-D#DOG#2D "
SECOND_TEST = 'orion#Orion#oRion,Mike'
This is my problem area RegXps...
_string_rgx = /([a-zA-Z0-9]+(-|_)?\w+|#?[a-zA-Z0-9]+(-|_)?\w+)/
add_pound_sign = lambda { |a| a[0].chr == '#' ? a : a='#' + a; a}
I don't know that much Regular Expressions: hence the needed collect the first[element] from the result of the scan -> It yielded weird stuff but the first element was always what I wanted.
t_word = WORD_TEST.scan(_string_rgx).collect {|i| i[0] }
s_word = SECOND_TEST.scan(_string_rgx).collect {|i| i[0] }
t_word.map! { |a| a = add_pound_sign.call(a); a }
s_word.map! { |a| a = add_pound_sign.call(a); a }
The results are what I want. I just want insight from Ruby | Regex guru's out there.
puts t_word.inspect
[
"#123", "#sunset", "#2d2-apple", "#home", "#star", "#Babyclub",
"#apple_surprise", "#apple", "#cats", "#mustard", "#dog",
"#basic_cable", "#safety", "#222", "#dog-D", "#DOG", "#2D"
]
puts s_word.inspect
[
"#orion", "#Orion", "#oRion", "#Mike"
]
Thanks in advance.
Lets unfold the regex:
(
[a-zA-Z0-9]+ (-|_)? \w+
| #? [a-zA-Z0-9]+ (-|_)? \w+
)
( begin capture group
[a-zA-Z0-9]+ match one or more alphanumeric characters
(-|_)? match a hyphen or an underscore and save. This group may fail
\w+ match one or more "word" characters (alphanumeric + underscore)
| OR match this:
#? match optional # character
[a-zA-Z0-9]+ match one or more alphanumeric characters
(-|_)? match hyphen or underscore and capture. may fail.
\w+ match one or more word characters
) end capature
I'd rather write this regex like this;
(#? [a-zA-Z0-9]+ (-|_)? \w+)
or
( #? [a-zA-Z0-9]+ (-?\w+)? )
or
( #? [a-zA-Z0-9]+ -? \w+ )
(all are reasonably equivalent)
You should note that this regex will fail on hashtags with unicode characters, eg #Ü-Umlaut, #façadeetc. You are also limited to a two-character minimum length (#a fails, #ab matches) and may have only one hyphen (#a-b-c fails / would return #a-b)
I would reduce your Regex pattern such as this:
WORD_TEST = "123 sunset #2d2-apple,#home,#star #Babyclub, #apple_surprise #apple,cats mustard#dog , #basic_cable safety #222 #dog-D#DOG#2D "
foo = []
WORD_TEST.scan(/#?[-\w]+\b/) do |s|
foo.push( s[0] != '#' ? '#' + s : s )
end

Resources