How to count instances of any Unicode letter in my string - ruby

Using Ruby 2.4, how do I count the number of instances of a Unicode letter in my string? I'm trying:
2.4.0 :009 > string = "a"
=> "a"
2.4.0 :010 > string.count('\p{L}')
=> 0
but it's displaying 0, and it should be returning 1.
I want to use the above expression rather than "a-z" because "a-z" won't cover things like accented e's.

You could try using String#scan, passing your \p{L} regex, and then chain the count method:
string = "aá"
p string.scan(/\p{L}/).count
# 2

This is a way that does not create a temporary array.
str = "Même temps l'année prochaine."
str.size - str.gsub(/[[:alpha:]]/, '').size
#=> 24
The POSIX bracket expression [[:alpha:]] is the same as \p{Alpha} (aka \p{L}).
Note that
str.gsub(/[[:alpha:]]/, '')
#=> " ' ."

Related

ruby: Grab numbers only within quotes

I would like the following sub-string
1100110011110000
from
foo = "bar9-9 '11001100 11110000 A'A\n"
I have so far used the below, which yields
puts foo.split(',').map!(&:strip)[0].gsub(/\D/, '')
>> 991100110011110000
Getting rid of the 2 leading 9's is not too difficult in this scenario, but I would like a general solution which grabs numbers only within the ' ' single quotes
You can find the quoted part first with scan and then remove non-digits:
> results = "bar9-9 '11001100 11110000 A'A\n".scan(/'[^']*'/).map{|m| m.gsub(/\D/, '')}
# => ["1100110011110000"]
> results[0]
# => "1100110011110000"
The zeros and ones within the quoted string can be extracted using String#gsub with a regular expression, as opposed to methods that convert the string to an array of strings, modify the array and converted it back to a string. Here are three ways of doing that.
str ="bar9-9 '11001100 11110000 A'A\n"
#1: Extract the substring of interest and then remove characters other than zero and one
def extract(str)
str[str.index("'")+1, str.rindex("'")-1].gsub(/[^01]/,'')
end
extract str
#=> "1100110011110000"
#2 Use a flag to indicate when zeroes and ones are to be kept
def extract(str)
found = false
str.gsub(/./m) do |c|
found = !found if c == "'"
(found && (c =~ /[01]/)) ? c : ''
end
end
extract str
#=> "1100110011110000"
Here the regular expression requires the m modifier (to enable multiline mode) in order to convert the newline character to an empty string. (One could alternatively write str.chomp.gsub(/./)....)
Notice that this second method works when there are multiple single-quoted substrings.
extract "bar9-9 '11001100 11110000 A'A'10x1y'\n"
#=> "1100110011110000101"
#3 Use the flip-flop operator (variant of #2)
def extract(str)
str.gsub(/./m) do |c|
next '' if (c=="'") .. (c=="'")
c =~ /[01]/ ? c : ''
end
end
extract str
#=> "1100110011110000"
extract "bar9-9 '11001100 11110000 A'A'10x1y'\n"
#=> "1100110011110000101"
foo.slice(/'.*?'/).scan(/\d+/).join
#=> "1100110011110000"

How to use Regexp.union to match a character at the beginning of my string

I'm using Ruby 2.4. I want to match an optional "a" or "b" character, followed by an arbitrary amount of white space, and then one or more numbers, but my regex's are failing to match any of these:
2.4.0 :017 > MY_TOKENS = ["a", "b"]
=> ["a", "b"]
2.4.0 :018 > str = "40"
=> "40"
2.4.0 :019 > str =~ Regexp.new("^[#{Regexp.union(MY_TOKENS)}]?[[:space:]]*\d+[^a-z^0-9]*$")
=> nil
2.4.0 :020 > str =~ Regexp.new("^#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+[^a-z^0-9]*$")
=> nil
2.4.0 :021 > str =~ Regexp.new("^#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+$")
=> nil
I'm stumped as to what I'm doing wrong.
If they are single characters, just use MY_TOKENS.join inside the character class:
MY_TOKENS = ["a", "b"]
str = "40"
first_regex = /^[#{MY_TOKENS.join}]?[[:space:]]*\d+[^a-z0-9]*$/
# /^[ab]?[[:space:]]*\d+[^a-z0-9]*$/
puts str =~ first_regex
# 0
You can also integrate the Regexp.union, it might lead to some unexpected bugs though, because the flags of the outer regexp won't apply to the inner one :
second_regex = /^#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+[^a-z0-9]*$/
# /^(?-mix:a|b)?[[:space:]]*\d+[^a-z0-9]*$/
puts str =~ second_regex
# 0
The above regex looks a lot like what you did, but using // instead of Regexp.new prevents you from having to escape the backslashes.
You could use Regexp#source to avoid this behaviour :
third_regex = /^(?:#{Regexp.union(MY_TOKENS).source})?[[:space:]]*\d+[^a-z0-9]*$/
# /^(?:a|b)?[[:space:]]*\d+[^a-z0-9]*$/
puts str =~ third_regex
# 0
or simply build your regex union :
fourth_regex = /^(?:#{MY_TOKENS.join('|')})?[[:space:]]*\d+[^a-z0-9]*$/
# /^(?:a|b)?[[:space:]]*\d+[^a-z0-9]*$/
puts str =~ fourth_regex
# 0
The 3 last examples should work fine if MY_TOKENS has words instead of just characters.
first_regex, third_regex and fourth_regex should all work fine with /i flag.
As an example :
first_regex = /^[#{MY_TOKENS.join}]?[[:space:]]*\d+[^a-z0-9]*$/i
"A 40" =~ first_regex
# 0
I believe you want to match a string that may contain any of the alternatives you defined in the MY_TOKENS, then 0+ whitespaces and then 1 or more digits up to the end of the string.
Then you need to use
Regexp.new("\\A#{Regexp.union(MY_TOKENS)}?[[:space:]]*\\d+\\z").match?(s)
or
/\A#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+\z/.match?(s)
When you use a Regexp.new, you should rememeber to double escape backslashes to define a literal backslash (e.g. "\d" is a digit matching pattern). In a regex literal notation, you may use a single backslash (/\d/).
Do not forget to match the start of a string with \A and end of string with \z anchors.
Note that [...] creates a character class that matches any char that is defined inside it: [ab] matches an a or b, [program] will match one char, either p, r, o, g, r, a or m. If you have multicharacter sequences in the MY_TOKENS, you need to remove [...] from the pattern.
To make the regex case insensitive, pass a case insensitive modifier to the pattern and make sure you use .source property of the Regex.union created regex to remove flags (thanks, Eric):
Regexp.new("(?i)\\A#{Regexp.union(MY_TOKENS).source}?[[:space:]]*\\d+\\z")
or
/\A#{Regexp.union(MY_TOKENS).source}?[[:space:]]*\d+\z/i
The regex created is /(?i-mx:\Aa|b?[[:space:]]*\d+\z)/ where (?i-mx) means the case insensitive mode is on and multiline (dot matches line breaks and verbose modes are off).

How to split a string without getting an empty string inserted in the array

I'm having trouble splitting a character from a string using a regular expression, assuming there is a match.
I want to split off either an "m" or an "f" character from the first part of a string assuming the next character is one or more numbers followed by optional space characters, followed by a string from an array I have.
I tried:
2.4.0 :006 > MY_SEPARATOR_TOKENS = ["-", " to "]
=> ["-", " to "]
2.4.0 :008 > str = "M14-19"
=> "M14-19"
2.4.0 :011 > str.split(/^(m|f)\d+[[:space:]]*#{Regexp.union(MY_SEPARATOR_TOKENS)}/i)
=> ["", "M", "19"]
Notice the extraneous "" element at the beginning of my array and also notice that the last expression is just "19" whereas I would want everything else in the string ("14-19").
How do I adjust my regular expression so that only the parts of the expression that get split end up in the array?
I find match to be a bit more elegant when extracting characters from regular expressions in Ruby:
string = "M14-19"
string.match(/\A(?<m>[M|F])(?<digits>\d{2}(-| to )\d{2})/)[1, 2]
=> ["M", "14-19"]
# also can extract the symbols from match
extract_string = string.match(/\A(?<m>[M|F])(?<digits>\d{2}(-| to )\d{2})/)
[[extract_string[:m], extract_string[:digits]]
=> ["M", "14-19"]
string = 'M14 to 14'
extract_string = string.match(/\A(?<m>[M|F])(?<digits>\d{2}(-| to )\d{2})/)[1, 2]
=> ["M", "14 to 14"]
TOKENS = ["-", " to "]
r = /
(?<=\A[mMfF]) # match the beginning of the string and then one
# of the 4 characters in a positive lookbehind
(?= # begin positive lookahead
\d+ # match one or more digits
[[:space:]]* # match zero or more spaces
(?:#{TOKENS.join('|')}) # match one of the tokens
) # close the positive lookahead
/x # free-spacing regex definition mode
(?:#{TOKENS.join('|')}) is replaced by (?:-| to ).
This can of course be written in the usual way.
r = /(?<=\A[mMfF])(?=\d+[[:space:]]*(?:#{TOKENS.join('|')}))/
When splitting on r you are splitting between two characters (between a positive lookbehind and a positive lookahead) so no characters are consumed.
"M14-19".split r
#=> ["M", "14-19"]
"M14 to 19".split r
#=> ["M", "14 to 19"]
"M14 To 19".split r
#=> ["M14 To 19"]
If it is desired that ["M", "14 To 19"] be returned in the last example, change [mMfF] to [mf] and /x to /xi.
You have a bug brewing in your code. Don't get in the habit of doing this:
#{Regexp.union(MY_SEPARATOR_TOKENS)}
You're setting yourself up with a very hard to debug problem.
Here's what's happening:
regex = Regexp.union(%w(a b)) # => /a|b/
/#{regex}/ # => /(?-mix:a|b)/
/#{regex.source}/ # => /a|b/
/(?-mix:a|b)/ is an embedded sub-pattern with its set of the regex flags m, i and x which are independent of the surrounding pattern's settings.
Consider this situation:
'CAT'[/#{regex}/i] # => nil
We'd expect that the regular expression i flag would match because it's ignoring case, but the sub-expression still only allows only lowercase, causing the match to fail.
Using the bare (a|b) or adding source succeeds because the inner expression gets the main expression's i:
'CAT'[/(a|b)/i] # => "A"
'CAT'[/#{regex.source}/i] # => "A"
See "How to embed regular expressions in other regular expressions in Ruby" for additional discussion of this.
The empty element will always be there if you get a match, because the captured part appears at the beginning of the string and the string between the start of the string and the match is added to the resulting array, be it an empty or non-empty string. Either shift/drop it once you get a match, or just remove all empty array elements with .reject { |c| c.empty? } (see How do I remove blank elements from an array?).
Then, 14- is eaten up (consumed) by the \d+[[:space:]]... pattern part - put it into a (?=...) lookahead that will just check for the pattern match, but won't consume the characters.
Use something like
MY_SEPARATOR_TOKENS = ["-", " to "]
s = "M14-19"
puts s.split(/^(m|f)(?=\d+[[:space:]]*#{Regexp.union(MY_SEPARATOR_TOKENS)})/i).drop(1)
#=> ["M", "14-19"]
See Ruby demo

Why is RegExp.escape not working in my Ruby expression?

I'm using Ruby 2.4. I have some strings that contain characters that have special meaning in regular expression. So to eliminate any possibility of those characters being interpreted as regexp characters, I use the "Regexp.escape" to attempt to escape them. However, I still seem unable to make teh below regular expression work ...
2.4.0 :005 > tokens = ["a", "b?", "c"]
=> ["a", "b?", "c"]
2.4.0 :006 > line = "1\ta\tb?\tc\t3"
=> "1\ta\tb?\tc\t3"
2.4.0 :009 > /#{Regexp.escape(tokens.join(" ")).gsub(" ", "\\s+")}/.match(line)
=> nil
How can I properly escape the characters before substituting the space with a "\s+" expression, whcih I do want interpreted as a regexp character?
When the Regexp.escape(tokens.join(" ")).gsub(" ", "\\s+") is executed, tokens.join(" ") yields a b? c, then the string is escaped -> a\ b\?\ c, and then the gsub is executed resulting in a\\s+b\?\\s+c. Now, line is 1 a b? c 3. So, all \\ are now matching a literal backslash, they no longer form an special regex metacharacter matching whitespace.
You need to escape the tokens, and join with \s+, or join with space and later replace the space with \s+:
/#{tokens.map { |n| Regexp.escape(n) }.join("\\s+")}/.match(line)
OR
/#{tokens.map { |n| Regexp.escape(n) }.join(" ").gsub(" ", "\\s+")}/.match(line)

Get last character in string

I want to get the last character in a string MY WAY - 1) Get last index 2) Get character at last index, as a STRING. After that I will compare the string with another, but I won't include that part of code here. I tried the code below and I get a strange number instead. I am using ruby 1.8.7.
Why is this happening and how do I do it ?
line = "abc;"
last_index = line.length-1
puts "last index = #{last_index}"
last_char = line[last_index]
puts last_char
Output-
last index = 3
59
Ruby docs told me that array slicing works this way -
a = "hello there"
a[1] #=> "e"
But, in my code it does not.
UPDATE:
I keep getting constant up votes on this, hence the edit. Using [-1, 1] is correct, however a better looking solution would be using just [-1]. Check Oleg Pischicov's answer.
line[-1]
# => "c"
Original Answer
In ruby you can use [-1, 1] to get last char of a string. Here:
line = "abc;"
# => "abc;"
line[-1, 1]
# => ";"
teststr = "some text"
# => "some text"
teststr[-1, 1]
# => "t"
Explanation:
Strings can take a negative index, which count backwards from the end
of the String, and an length of how many characters you want (one in
this example).
Using String#slice as in OP's example: (will work only on ruby 1.9 onwards as explained in Yu Hau's answer)
line.slice(line.length - 1)
# => ";"
teststr.slice(teststr.length - 1)
# => "t"
Let's go nuts!!!
teststr.split('').last
# => "t"
teststr.split(//)[-1]
# => "t"
teststr.chars.last
# => "t"
teststr.scan(/.$/)[0]
# => "t"
teststr[/.$/]
# => "t"
teststr[teststr.length-1]
# => "t"
Just use "-1" index:
a = "hello there"
a[-1] #=> "e"
It's the simplest solution.
If you are using Rails, then apply the method #last to your string, like this:
"abc".last
# => c
You can use a[-1, 1] to get the last character.
You get unexpected result because the return value of String#[] changed. You are using Ruby 1.8.7 while referring the the document of Ruby 2.0
Prior to Ruby 1.9, it returns an integer character code. Since Ruby 1.9, it returns the character itself.
String#[] in Ruby 1.8.7:
str[fixnum] => fixnum or nil
String#[] in Ruby 2.0:
str[index] → new_str or nil
In ruby you can use something like this:
ending = str[-n..-1] || str
this return last n characters
Using Rails library, I would call the method #last as the string is an array. Mostly because it's more verbose..
To get the last character.
"hello there".last() #=> "e"
To get the last 3 characters you can pass a number to #last.
"hello there".last(3) #=> "ere"
Slice() method will do for you.
For Ex
"hello".slice(-1)
# => "o"
Thanks
Your code kinda works, the 'strange number' you are seeing is ; ASCII code. Every characters has a corresponding ascii code ( https://www.asciitable.com/). You can use for conversationputs last_char.chr, it should output ;.

Resources