Does multibyte character interfere with end-line character within a regex? - ruby

With this regex:
regex1 = /\z/
the following strings match:
"hello" =~ regex1 # => 5
"こんにちは" =~ regex1 # => 5
but with these regexes:
regex2 = /#$/?\z/
regex3 = /\n?\z/
they show difference:
"hello" =~ regex2 # => 5
"hello" =~ regex3 # => 5
"こんにちは" =~ regex2 # => nil
"こんにちは" =~ regex3 # => nil
What is interfering? The string encoding is UTF-8, and the OS is Linux (i.e., $/ is "\n"). Are the multibyte characters interfering with $/? How?

The problem you reported is definitely a bug of the Regexp of RUBY_VERSION #=> "2.0.0" but already existing in previous 1.9 when the encoding allow multi-byte chars such as __ENCODING__ #=> #<Encoding:UTF-8>
Does not depend on Linux , it's possibile to reproduce the same behavoir in OSX and Windows too.
In the while bug 8210 will be fixed, we can help by isolating and understanding the cases in which the problem occurs.
This can also be useful for any workaround when applicable to specific cases.
I understand that the problem occurs when:
searching something before end of string \z.
and the last character of the string is multi-byte.
and the the before search uses zero or one pattern ?
but the number of zero or one char searched in less than the number of bytes of the last character.
The bug may be caused by misunderstandings between the number of bytes and the number of chars that is actually checked by the regular expression engine.
A few examples may help:
TEST 1: where last character:"は" is 3 bytes:
s = "んにちは"
testing for zero or one of ん [3 bytes] before end of string:
s =~ /ん?\z/u #=> 4" # OK it works 3 == 3
when we try with ç [2 bytes]
s =~ /ç?\z/u #=> nil # KO: BUG when 3 > 2
s =~ /x?ç?\z/u #=> 4 # OK it works 3 == ( 1+2 )
when test for zero or one of \n [1 bytes]
s =~ /\n?\z/u #=> nil" # KO: BUG when 3 > 1
s =~ /\n?\n?\z/u #=> nil" # KO: BUG when 3 > 2
s =~ /\n?\n?\n?\z/u #=> 4" # OK it works 3 == ( 1+1+1)
By results of TEST1 we can assert: if the last multi-byte character of the string is 3 bytes , then the 'zero or one before' test only works when we test for at least 3 bytes (not 3 character) before.
TEST 2: Where last character "ç" is 2 bytes
s = "in French there is the ç"
check for zero or one of ん [3 bytes]"
s =~ /ん?\z/u #=> 24 # OK 2 <= 3
check for zero or one of é [2 bytes]
s =~ /é?\z/u #=> 24 # OK 2 == 2
s =~ /x?é?\z/u #=> 24 # OK 2 < (2+1)
test for zero or one of \n [1 bytes]
s =~ /\n?\z/u #=> nil # KO 2 > 1 ( the BUG occurs )
s =~ /\n?\n?\z/u #=> 24 # OK 2 == (1+1)
s =~ /\n?\n?\n?\z/u #=> 24 # OK 2 < (1+1+1)
By results of TEST2 we can assert: if the last multi-byte character of the string is 2 bytes , then the 'zero or one before' test only works when we check for at least 2 bytes (not 2 character) before.
When the multi-byte character is not at the end of the string I found it works correctly.
public gist with my test code available here

In Ruby trunk, the issue has now been accepted as a bug. Hopefully, it will be fixed.
Update: Two patches have been posted in Ruby trunk.

Related

Simple regex - ignoring certain characters

I'm trying to use the match method with an argument of a regex to select a valid phone number, by definition, any string with nine digits.
For example:
9347584987 is valid,
(456)322-3456 is valid,
(324)5688890 is valid.
But
(340)HelloWorld is NOT valid and
456748 is NOT valid.
So far, I'm able to use \d{9} to select the example string of 9 digit characters in a row, but I'm not sure how to specifically ignore any character, such as '-' or '(' or ')' in the middle of the sequence.
What kind of Regex could I use here?
Given:
nums=['9347584987','(456)322-3456','(324)5688890','(340)HelloWorld', '456748 is NOT valid']
You can split on a NON digit and rejoin to remove non digits:
> nums.map {|s| s.split(/\D/).join}
["9347584987", "4563223456", "3245688890", "340", "456748"]
Then filter on the length:
> nums.map {|s| s.split(/\D/).join}.select {|s| s.length==10}
["9347584987", "4563223456", "3245688890"]
Or, you can grab a group of numbers that look 'phony numbery' by using a regex to grab digits and common delimiters:
> nums.map {|s| s[/[\d\-()]+/]}
["9347584987", "(456)322-3456", "(324)5688890", "(340)", "456748"]
And then process that list as above.
That would delineate:
> '123 is NOT a valid area code for 456-7890'[/[\d\-()]+/]
=> "123" # no match
vs
> '123 is NOT a valid area code for 456-7890'.split(/\D/).join
=> "1234567890" # match
I suggest using one regular expression for each valid pattern rather than constructing a single regex. It would be easier to test and debug, and easier to maintain the code. If, for example, "123-456-7890" or 123-456-7890 x231" were in future deemed valid numbers, one need only add a single, simple regex for each to the array VALID_PATTERS below.
VALID_PATTERS = [/\A\d{10}\z/, /\A\(\d{3}\)\d{3}-\d{4}\z/, /\A\(\d{3}\)\d{7}\z/]
def valid?(str)
VALID_PATTERS.any? { |r| str.match?(r) }
end
ph_nbrs = %w| 9347584987 (456)322-3456 (324)5688890 (340)HelloWorld 456748 |
ph_nbrs.each { |s| puts "#{s.ljust(15)} \#=> #{valid?(s)}" }
9347584987 #=> true
(456)322-3456 #=> true
(324)5688890 #=> true
(340)HelloWorld #=> false
456748 #=> false
String#match? made its debut in Ruby v2.4. There are many alternatives, including str.match(r) and str =~ r.
"9347584987" =~ /(?:\d.*){9}/ #=> 0
"(456)322-3456" =~ /(?:\d.*){9}/ #=> 1
"(324)5688890" =~ /(?:\d.*){9}/ #=> 1
"(340)HelloWorld" =~ /(?:\d.*){9}/ #=> nil
"456748" =~ /(?:\d.*){9}/ #=> nil
Pattern: (Rubular Demo)
^\(?\d{3}\)?\d{3}-?\d{4}$ # this makes the expected symbols optional
This pattern will ensure that an opening ( at the start of the string is followed by 3 numbers the a closing ).
^(\(\d{3}\)|\d{3})\d{3}-?\d{4}$
On principle, though, I agree with melpomene in advising that you remove all non-digital characters, test for 9 character length, then store/handle the phone numbers in a single/reliable/basic format.

Ruby Count lines in file including last line(empty)

I'm trying to count the lines of a file with ruby but I can't get either IO or File to count the last line.
What do I mean by last line?
Here's a screenshot of Atom editor getting that last line
Ruby returns 20 lines, I need 21 lines. Here is such file
https://copy.com/cJbiAS4wxjsc9lWI
Interesting question (although your example file is cumbersome). Your editor shows a 21st line because the 20th line ends with a newline character. Without a trailing newline character, your editor would show 20 lines.
Here's a simpler example:
a = "foo\nbar"
b = "baz\nqux\n"
A text editor would show:
# file a
1 foo
2 bar
# file b
1 baz
2 qux
3
Ruby however sees 2 lines in either cases:
a.lines #=> ["foo\n", "bar"]
a.lines.count #=> 2
b.lines #=> ["baz\n", "qux\n"]
b.lines.count #=> 2
You could trick Ruby into recognizing the trailing newline by adding an arbitrary character:
(a + '_').lines #=> ["foo\n", "bar_"]
(a + '_').lines.count #=> 2
(b + '_').lines #=> ["baz\n", "qux\n", "_"]
(b + '_').lines.count #=> 3
Or you could use a Regexp that matches either end of line ($) or end of string (\Z):
a.scan(/$|\Z/) #=> ["", ""]
a.scan(/$|\Z/).count #=> 2
b.scan(/$|\Z/) #=> ["", "", ""]
b.scan(/$|\Z/).count #=> 3
Ruby lines method doesn't count the last empty line.
To trick, you can add an arbitrary character at the end of your stream.
Ruby lines returns 2 lines for this example:
1 Hello
2 World
3
Instead, it returns 3 lines in this case
1 Hello
2 World
3 *

Checking string with minimum 8 digits using regex

I have regex as follows:
/^(\d|-|\(|\)|\+|\s){12,}$/
This will allow digits, (, ), space. But I want to ensure string contains atleast 8 digits.
Some allowed strings are as follows:
(1323 ++24)233
24243434 43
++++43435++4554345 434
It should not allow strings like:
((((((1213)))
++++232+++
Use Look ahead within your regex at the start..
/^(?=(.*\d){8,})[\d\(\)\s+-]{8,}$/
---------------
|
|->this would check for 8 or more digits
(?=(.*\d){8,}) is zero width look ahead that checks for 0 to many character (i.e .*) followed by a digit (i.e \d) 8 to many times (i.e.{8,0})
(?=) is called zero width because it doesnt consume the characters..it just checks
To restict it to 14 digits you can do
/^(?=([^\d]*\d){8,14}[^\d]*$)[\d\(\)\s+-]{8,}$/
try it here
Here's a non regular expression solution
numbers = ["(1323 ++24)233", "24243434 43" , "++++43435++4554345 434", "123 456_7"]
numbers.each do |number|
count = 0
number.each_char do |char|
count += 1 if char.to_i.to_s == char
break if count > 7
end
puts "#{count > 7}"
end
No need to mention ^, $, or the "or more" part of {8,}, or {12,}, which is unclear where it comes from.
The following makes the intention transparent.
r = /
(?=(?:.*\d){8}) # First condition: Eight digits
(?!.*[^-\d()+\s]) # Second condition: Characters other than `[-\d()+\s]` should not be included.
/x
resulting in:
"(1323 ++24)233" =~ r #=> 0
"24243434 43" =~ r #=> 0
"++++43435++4554345 434" =~ r #=> 0
"((((((1213)))" =~ r #=> nil
"++++232+++" =~ r #=> nil

Add 0 padding to number in middle of string in ruby

This may be a really simple regex but its one of those problems that have proven hard to google.
I have error codes coming back from a third party system. They are supposed to be in the format:
ZZZ##
where Z is Alpha and # is numeric. They are supposed to be 0 padded, but i'm finding that sometimes they come back
ZZZ#
without the 0 padding.
Anyone know how i could add the 0 padding so i can use the string as an index to a hash?
Here's my take:
def pad str
number = str.scan(/\d+/).first
str[number] = "%02d" % number.to_i
str
end
6.times do |n|
puts pad "ZZZ#{7 + n}"
end
# >> ZZZ07
# >> ZZZ08
# >> ZZZ09
# >> ZZZ10
# >> ZZZ11
# >> ZZZ12
Reading:
String#[]=
Kernel#sprintf and formatting flags.
fixed = str.gsub /([a-z]{3})(\d)(?=\D|\z)/i, '\10\2'
That says:
Find three letters
…followed by a digit
…and make sure that then you see either a non-digit or the end of file
and replace with the three letters (\1), a zero (0), and then the digit (\2)
To pad to an arbitrary length, you could:
# Pad to six digits
fixed = str.gsub /([a-z]{3})(\d+)/i do
"%s%06d" % [ $1, $2.to_i ]
end
Here's mine:
"ZZZ7".gsub(/\d+/){|x| "%02d" % x}
=> "ZZZ07"
There's probably a million ways to do this but here's another look.
str.gsub!(/[0-9]+/ , '0\0' ) if str.length < 5

Convert Unicode number to Natural number in ruby

Currently I am facing problem with UNICODE character on my Rails 3 project.
In Khmer character number(unicode character) letter "៤" is equal to 4.
I want to compare ៤ >= 3 but can't.
Can anyone suggest me some idea about how to compare that? May be there are some method could convert ៤ to 4 so that I can do compare.
Note
I can type ៤ by switching keyboard from Eng to Khm and type 4 as normal.
Thanks
Do the numbers behave in the same way like arabic numerals? Then, you can use this little helper method to convert a Khmer-number string to an integer:
# encoding: utf-8
class String
def to_khmer
num_string = chars.map{ |c| %w[០ ១ ២ ៣ ៤ ៥ ៦ ៧ ៨ ៩].index(c) || c }.join
if num_string =~ /\./
num_string.to_f
else
num_string.to_i
end
end
end
Yes, you can do that
"s".ord == 115 #=> true
115.chr == "s" #=> true
4.chr.ord == 4 #=> true

Resources