Regular Expression for Ruby Integer - ruby

Ruby integers are written as using an optional leading sign, an optional base indicator (0 for octal, 0x for hex, or 0b for binary), followed by a string of digits in the appropriate base. Underscore characters are ignored in the digit string. The letters mentioned in the above description may be either upper or lower case and the underscore characters can only occur strictly within the digit string.
I need to create regular expression to check for Ruby integers in java string with the specification mentioned above.

I assume the substrings that may represent integers are separated by spaces or begin or end the string. If so, I suggest you split the string on whitespace and then the use the method Kernel#Integer to determine if each element of the resulting array represents an integer.
def str_to_int(str)
str.split.each_with_object([]) do |s,a|
val = Integer(s) rescue nil
a << [s, val] unless val.nil?
end
end
str_to_int "22 -22 077 0xAB 0xA_B 0b101 -0b101 cat _3 4_"
#=> [["22", 22], ["-22", -22], ["077", 63], ["0xAB", 171],
# ["0xA_B", 171], ["0b101", 5], ["-0b101", -5]]
Integer raises a TypeError exception is the number cannot be converted to an integer. I've dealt with that with an in-line rescue that returns nil, but you may wish to write it so that only that exception is rescued. It may be prudent remove punctuation from the string before executing the above method.

This regex captures positive or negative numbers in denary, binary, octal and hexidecimal form including any underscores:
# hexidecimal binary octal denary
-?0x[0-9a-fA-F][0-9a-fA-F_]*[0-9a-fA-F]|-?0x[0-9a-fA-F]|-?0b[01][01_]*[01]|-?0b[01]|-?0[0-7][0-7_]?[0-7]?|-?0[0-7]|-?[1-9][0-9_]*[0-9]|-?[0-9]
You should test the regex thoroughly to make sure it works as required but it does seem to work on a few relevant samples I tried (see this on Rubular where I've used () captures so you can see the matches more easily but it is essentially the same regex).
Here is an example of the regex in action using String#scan:
str = "-0x88339_43 wor0ds 8_8_ 0b1001 01words0x334 _9 0b1 0x4 0_ 0x_ 0b_1 0b00_1"
reg = /-?0x[0-9a-fA-F][0-9a-fA-F_]*[0-9a-fA-F]|-?0x[0-9a-fA-F]|-?0b[01][01_]*[01]|-?0b[01]|-?0[0-7][0-7_]?[0-7]?|-?0[0-7]|-?[1-9][0-9_]*[0-9]|-?[0-9]/
#regex matches
str.scan reg
#=>["-0x88339_43", "0", "8_8", "0b1001", "01", "0x334", "9", "0b1", "0x4", "0", "0", "0", "1", "0b00_1"]
Like #CarySwoveland, I'm assuming your string has spaces. Without spaces you will still get a result but it may not be what you desire, but at least it's a start.

Related

Ruby .to_i does not return the complete integer as expected

My ruby command is,
"980,323,344.00".to_i
Why does it return 980 instead of 980323344?
You can achieve it by doing this :
"980,323,344.00".delete(',').to_i
The reason your method call to to_i does not return as expected is explained here, and to quote, the method :
Returns the result of interpreting leading characters in str as an integer base base (between 2 and 36). Extraneous characters past the end of a valid number are ignored.
Extraneous characters in your case would be the comma character that ends at 980, the reason why you see 980 being returned
In ruby calling to_i on a string will truncate from the beginning of a string where possible.
number_string = '980,323,344.00'
number_string.delete(',').to_i
#=> 980323344
"123abc".to_i
#=> 123
If you want to add underscores to make longer number more readable, those can be used where the conventional commas would be in written numbers.
"980_323_344.00".to_i
#=> 980323344
The documentation for to_i might be a bit misleading:
Returns the result of interpreting leading characters in str as an integer base base (between 2 and 36)
"interpreting" doesn't mean that it tries to parse various number formats (like Date.parse does for date formats). It means that it looks for what's a valid integer literal in Ruby (in the given base). For example:
1234. #=> 1234
'1234'.to_i #=> 1234
1_234. #=> 1234
'1_234'.to_i. #=> 1234
0d1234 #=> 1234
'0d1234'.to_i #=> 1234
0x04D2 #=> 1234
'0x04D2'.to_i(16) #=> 1234
Your input as a whole however is not a valid integer literal: (Ruby doesn't like the ,)
980,323,344.00
# SyntaxError (syntax error, unexpected ',', expecting end-of-input)
# 980,323,344.00
# ^
But it starts with a valid integer literal. And that's where the the seconds sentence comes into play:
Extraneous characters past the end of a valid number are ignored.
So the result is 980 – the leading characters which form a valid integer converted to an integer.
If your strings always have that format, you can just delete the offending commas and run the result through to_i which will ignore the trailing .00:
'980,323,344.00'.delete(',') #=> "980323344.00"
'980,323,344.00'.delete(',').to_i #=> 980323344
Otherwise you could use a regular expression to check its format before converting it:
input = '980,323,344.00'
number = case input
when /\A\d{1,3}(,\d{3})*\.00\z/
input.delete(',').to_i
when /other format/
# other conversion
end
And if you are dealing with monetary values, you should consider using the money gem and its monetize addition for parsing formatted values:
amount = Monetize.parse('980,323,344.00')
#=> #<Money fractional:98032334400 currency:USD>
amount.format
#=> "$980.323.344,00"
Note that format requires i18n so the above example might require some setup.

Split by multiple delimiters

I'm receiving a string that contains two numbers in a handful of different formats:
"344, 345", "334,433", "345x532" and "432 345"
I need to split them into two separate numbers in an array using split, and then convert them using Integer(num).
What I've tried so far:
nums.split(/[\s+,x]/) # split on one or more spaces, a comma or x
However, it doesn't seem to match multiple spaces when testing. Also, it doesn't allow a space in the comma version shown above ("344, 345").
How can I match multiple delimiters?
You are using a character class in your pattern, and it matches only one character. [\s+,x] matches 1 whitespace, or a +, , or x. You meant to use (?:\s+|x).
However, perhaps, a mere \D+ (1 or more non-digit characters) should suffice:
"345, 456".split(/\D+/).map(&:to_i)
R1 = Regexp.union([", ", ",", "x", " "])
#=> /,\ |,|x|\ /
R2 = /\A\d+#{R1}\d+\z/
#=> /\A\d+(?-mix:,\ |,|x|\ )\d+\z/
def split_it(s)
return nil unless s =~ R2
s.split(R1).map(&:to_i)
end
split_it("344, 345") #=> [344, 345]
split_it("334,433") #=> [334, 433]
split_it("345x532") #=> [345, 532]
split_it("432 345") #=> [432, 345]
split_it("432&345") #=> nil
split_it("x32 345") #=> nil
Your original regex would work with a minor adjustment to move the '+' symbol outside the character class:
"344 ,x 345".split(/[\s,x]+/).map(&:to_i) #==> [344,345]
If the examples are actually the only formats that you'll encounter, this will work well. However, if you have to be more flexible and accommodate unknown separators between the numbers, you're better off with the answer given by Wiktor:
"344 ,x 345".split(/\D+/).map(&:to_i) #==> [344,345]
Both cases will return an array of Integers from the inputs given, however the second example is both more robust and easier to understand at a glance.
it doesn't seem to match multiple spaces when testing
Yeah, character class (square brackets) doesn't work like this. You apply quantifiers on the class itself, not on its characters. You could use | operator instead. Something like this:
.split(%r[\s+|,\s*|x])

Use regular expression to fetch 3 groups from string

This is my expected result.
Input a string and get three returned string.
I have no idea how to finish it with Regex in Ruby.
this is my roughly idea.
match(/(.*?)(_)(.*?)(\d+)/)
Input and expected output
# "R224_OO2003" => R224, OO, 2003
# "R2241_OOP2003" => R2244, OOP, 2003
If the example description I gave in my comment on the question is correct, you need a very straightforward regex:
r = /(.+)_(.+)(\d{4})/
Then:
"R224_OO2003".scan(r).flatten #=> ["R224", "OO", "2003"]
"R2241_OOP2003".scan(r).flatten #=> ["R2241", "OOP", "2003"]
Assuming that your three parts consist of (R and one or more digits), then an underbar, then (one or more non-whitespace characters), before finally (a 4-digit numeric date), then your regex could be something like this:
^(R\d+)_(\S+)(\d{4})$
The ^ indicates start of string, and the $ indicates end of string. \d+ indicates one or more digits, while \S+ says one or more non-whitespace characters. The \d{4} says exactly four digits.
To recover data from the matches, you could either use the pre-defined globals that line up with your groups, or you could could use named captures.
To use the match globals just use $1, $2, and $3. In general, you can figure out the number to use by counting the left parentheses of the specific group.
To use the named captures, include ? right after the left paren of a particular group. For example:
x = "R2241_OOP2003"
match_data = /^(?<first>R\d+)_(?<second>\S+)(?<third>\d{4})$/.match(x)
puts match_data['first'], match_data['second'], match_data['third']
yields
R2241
OOP
2003
as expected.
As long as your pattern covers all possibilities, then you just need to use the match object to return the 3 strings:
my_match = "R224_OO2003".match(/(.*?)(_)(.*?)(\d+)/)
#=> #<MatchData "R224_OO2003" 1:"R224" 2:"_" 3:"OO" 4:"2003">
puts my_match[0] #=> "R224_OO2003"
puts my_match[1] #=> "R224"
puts my_match[2] #=> "_"
puts my_match[3] #=> "00"
puts my_match[4] #=> "2003"
A MatchData object contains an array of each match group starting at index [1]. As you can see, index [0] returns the entire string. If you don't want the capture the "_" you can leave it's parentheses out.
Also, I'm not sure you are getting what you want with the part:
(.*?)
this basically says one or more of any single character followed by zero or one of any single character.

How to validate that a string is a proper hexadecimal value in Ruby?

I am writing a 6502 assembler in Ruby. I am looking for a way to validate hexadecimal operands in string form. I understand that the String object provides a "hex" method to return a number, but here's a problem I run into:
"0A".hex #=> 10 - a valid hexadecimal value
"0Z".hex #=> 0 - invalid, produces a zero
"asfd".hex #=> 10 - Why 10? I guess it reads 'a' first and stops at 's'?
You will get some odd results by typing in a bunch of gibberish. What I need is a way to first verify that the value is a legit hex string.
I was playing around with regular expressions, and realized I can do this:
true if "0A" =~ /[A-Fa-f0-9]/
#=> true
true if "0Z" =~ /[A-Fa-f0-9]/
#=> true <-- PROBLEM
I'm not sure how to address this issue. I need to be able to verify that letters are only A-F and that if it is just numbers that is ok too.
I'm hoping to avoid spaghetti code, riddled with "if" statements. I am hoping that someone could provide a "one-liner" or some form of elegent code.
Thanks!
!str[/\H/] will look for invalid hex values.
String#hex does not interpret the whole string as hex, it extracts from the beginning of the string up to as far as it can be interpreted as hex. With "0Z", the "0" is valid hex, so it interpreted that part. With "asfd", the "a" is valid hex, so it interpreted that part.
One method:
str.to_i(16).to_s(16) == str.downcase
Another:
str =~ /\A[a-f0-9]+\Z/i # or simply /\A\h+\Z/ (see hirolau's answer)
About your regex, you have to use anchors (\A for begin of string and \Z for end of string) to say that you want the full string to match. Also, the + repeats the match for one or more characters.
Note that you could use ^ (begin of line) and $ (end of line), but this would allow strings like "something\n0A" to pass.
This is an old question, but I just had the issue myself. I opted for this in my code:
str =~ /^\h+$/
It has the added benefit of returning nil if str is nil.
Since Ruby has literal hex built-in, you can eval the string and rescue the SyntaxError
eval "0xA" => 10
eval "0xZ" => SyntaxError
You can use this on a method like
def is_hex?(str)
begin
eval("0x#{str}")
true
rescue SyntaxError
false
end
end
is_hex?('0A') => true
is_hex?('0Z') => false
Of course since you are using eval, make sure you are sending only safe values to the methods

Escape problem with hex

I need to print escaped characters to a binary file using Ruby. The main problem is that slashes need the whole byte to escape correctly, and I don't know/can't create the byte in such a way.
I am creating the hex value with, basically:
'\x' + char
Where char is some 'hex' value, such as 65. In hex, \x65 is the ASCII character 'e'.
Unfortunately, when I puts this sequence to the file, I end up with this:
\\x65
How do I create a hex string with the properly escaped value? I have tried a lot of things, involving single or double quotes, pack, unpack, multiple slashes, etc. I have tried so many different combinations that I feel as though I understand the problem less now then I did when I started.
How?
You may need to set binary mode on your file, and/or use putc.
File.open("foo.tmp", "w") do |f|
f.set_encoding(Encoding::BINARY) # set_encoding is Ruby 1.9
f.binmode # only useful on Windows
f.putc "e".hex
end
Hopefully this can give you some ideas even if you have Ruby <1.9.
Okay, if you want to create a string whose first byte
has the integer value 0x65, use Array#pack
irb> [0x65].pack('U')
#=> "e"
irb> "e"[0]
#=> 101
10110 = 6516, so this works.
If you want to create a literal string whose first byte is '\',
second is 'x', third is '6', and fourth is '5', then just use interpolation:
irb> "\\x#{65}"
#=> "\\x65"
irb> "\\x65".split('')
#=> ["\\", "x", "6", "5"]
If you have the hex value and you want to create a string containing the character corresponding to that hex value, you can do:
irb(main):002:0> '65'.hex.chr
=> "e"
Another option is to use Array#pack; this can be used if you need to convert a list of numbers to a single string:
irb(main):003:0> ['65'.hex].pack("C")
=> "e"
irb(main):004:0> ['66', '6f', '6f'].map {|x| x.hex}.pack("C*")
=> "foo"

Resources