UTF-8 Encoding in Ruby using a variable - ruby

I am using Ruby 1.8.7 (and upgrading isn't an option). I would like to create a string of all UTF-8 code points from 0 to 127, written as "\uXXXX".
My problem is that this is being interpreted as (for example): 'u0008'. If I try to use '\u0008', the string becomes "\u0008" which IS NOT what I want.
I have tried many different ways, but it seems impossible to create a string that is exactly just "\uXXXX" ie. "\u000B". it always is either "\u000B" or "u000B"
Escaping the '\' isn't an option. I need to send a string to a server, such that the server will receive '\u000B' for example. It is so that other server can test its parsing of the \uXXXX syntax. This seems impossible to do in Ruby however.
Happy if someone can prove me wrong :)

Use Integer #chr to get the character. Here's a clean version:
(1..127).each do |i|
value << "U+#{i} = #{i.chr}, hex = \\x#{"%02x" % i}; "
end
The "%02x" % i is the equal to sprintf("%02x", i). It returns the integer as a 2-digit hexadecimal number.
Escaped output (see comments):
(1..127).each do |i|
value << "U+#{i} = \\u#{"%04x" % i}, hex = \\x#{"%02x" % i}; "
end

Related

Ruby character .ord and .chr [duplicate]

I've been working with the Ruby chr and ord methods recently and there are a few things I don't understand.
My current project involves converting individual characters to and from ordinal values. As I understand it, if I have a string with an individual character like "A" and I call ord on it I get its position on the ASCII table which is 65. Calling the inverse, 65.chr gives me the character value "A", so this tells me that Ruby has a collection somewhere of ordered character values, and it can use this collection to give me the position of a specific character, or the character at a specific position. I may be wrong on this, please correct me if I am.
Now I also understand that Ruby's default character encoding uses UTF-8 so it can work with thousands of possible characters. Thus if I ask it for something like this:
'好'.ord
I get the position of that character which is 22909. However, if I call chr on that value:
22909.chr
I get "RangeError: 22909 out of char range." I'm only able to get char to work on values up to 255 which is extended ASCII. So my questions are:
Why does Ruby seem to be getting values for chr from the extended ASCII character set but ord from UTF-8?
Is there any way to tell Ruby to use different encodings when it uses these methods? For instance, tell it to use ASCII-8BIT encoding instead of whatever it's defaulting to?
If it is possible to change the default encoding, is there any way of getting the total number of characters available in the set being used?
According to Integer#chr you can use the following to force the encoding to be UTF_8.
22909.chr(Encoding::UTF_8)
#=> "好"
To list all available encoding names
Encoding.name_list
#=> ["ASCII-8BIT", "UTF-8", "US-ASCII", "UTF-16BE", "UTF-16LE", "UTF-32BE", "UTF-32LE", "UTF-16", "UTF-32", ...]
A hacky way to get the maximum number of characters
2000000.times.reduce(0) do |x, i|
begin
i.chr(Encoding::UTF_8)
x += 1
rescue
end
x
end
#=> 1112064
After tooling around with this for a while, I realized that I could get the max number of characters for each encoding by running a binary search to find the highest value that doesn't throw a RangeError.
def get_highest_value(set)
max = 10000000000
min = 0
guess = 5000000000
while true
begin guess.chr(set)
if (min > max)
return max
else
min = guess + 1
guess = (max + min) / 2
end
rescue
if min > max
return max
else
max = guess - 1
guess = (max + min) / 2
end
end
end
end
The value input to the method is the name of the encoding being checked.

Converting a string a number (exactly as it is represented in it)?

I have the following:
{:department=>{"Pet Supplies"=>{"Birds"=>"16,414", "Cats"=>"243,384",
"Dogs"=>"512,186", "Fish & Aquatic Pets"=>"47,018",
"Horses"=>"14,749", "Insects"=>"359", "Reptiles &
Amphibians"=>"5,794", "Small Animals"=>"19,797"}}}
Now if I use to_i I get say 16. If I do to_f I get something like 16.0 (and as you can see Ruby is considering the , as a . for some reason).
I want the number to be exactly as in the string but as a number instead: "Birds"=>16,414
How to accomplish that?
Just a notice:
If I do to_f I get something like 16.0 (and as you can see Ruby is considering the , as a . for some reason)
Ruby is not treating the , as a . at all. If it would the resulting float would be 16.414 and not 16.0. Ruby is just noticing an extraneous character and decides to ignore ,414.
How to accomplish that?
Well if you want 16,414 to be transformed to 16414 there's nothing as easy as just removing the character:
str = '16,414'
str.delete(',').to_i
# => 16414
In some cultures the , is considered a floating point. In that case, if you want to return 16.414 you can just transform the , into . and convert to Float:
str = '16,414'
str.gsub(/,/, '.').to_f
# => 16.414
Try something like below:
"16,414".gsub(",","_").to_i
# => 16414
or(as #Chris Heald suggested)
"19,797".delete(",").to_i
# => 19797
as you can see Ruby is considering the , as a . for some reason
Yes, it's all quite confusing:
class String
to_i(base=10) → integer
Returns the result of interpreting leading characters in str as an
integer base base (between 2 and 36). Extraneous characters past the
end of a valid number are ignored.
to_f → float
Returns the result of interpreting leading characters in str as a
floating point number. Extraneous characters past the end of a valid
number are ignored.
The ruby docs are public. They are not secret. In fact, you probably have the docs on your computer. Try this:
$ ri String#to_i

Converting a hexadecimal number to binary in ruby

I am trying to convert a hex value to a binary value (each bit in the hex string should have an equivalent four bit binary value). I was advised to use this:
num = "0ff" # (say for eg.)
bin = "%0#{num.size*4}b" % num.hex.to_i
This gives me the correct output 000011111111. I am confused with how this works, especially %0#{num.size*4}b. Could someone help me with this?
You can also do:
num = "0ff"
num.hex.to_s(2).rjust(num.size*4, '0')
You may have already figured out, but, num.size*4 is the number of digits that you want to pad the output up to with 0 because one hexadecimal digit is represented by four (log_2 16 = 4) binary digits.
You'll find the answer in the documentation of Kernel#sprintf (as pointed out by the docs for String#%):
http://www.ruby-doc.org/core/classes/Kernel.html#M001433
This is the most straightforward solution I found to convert from hexadecimal to binary:
['DEADBEEF'].pack('H*').unpack('B*').first # => "11011110101011011011111011101111"
And from binary to hexadecimal:
['11011110101011011011111011101111'].pack('B*').unpack1('H*') # => "deadbeef"
Here you can find more information:
Array#pack: https://ruby-doc.org/core-2.7.1/Array.html#method-i-pack
String#unpack1 (similar to unpack): https://ruby-doc.org/core-2.7.1/String.html#method-i-unpack1
This doesn't answer your original question, but I would assume that a lot of people coming here are, instead of looking to turn hexadecimal to actual "0s and 1s" binary output, to decode hexadecimal to a byte string representation (in the spirit of such utilities as hex2bin). As such, here is a good method for doing exactly that:
def hex_to_bin(hex)
# Prepend a '0' for padding if you don't have an even number of chars
hex = '0' << hex unless (hex.length % 2) == 0
hex.scan(/[A-Fa-f0-9]{2}/).inject('') { |encoded, byte| encoded << [byte].pack('H2') }
end
Getting back to hex again is much easier:
def bin_to_hex(bin)
bin.unpack('H*').first
end
Converting the string of hex digits back to binary is just as easy. Take the hex digits two at a time (since each byte can range from 00 to FF), convert the digits to a character, and join them back together.
def hex_to_bin(s) s.scan(/../).map { |x| x.hex.chr }.join end

Ruby: Fuzzing through all unicode characters ‎(UTF8/Encoding/String Manipulation)

I can't iterate over the entire range of unicode characters.
I searched everywhere...
I am building a fuzzer and want to embed into a url, all unicode characters (one at a time).
For example:
http://www.example.com?a=\uff1c
I know that there are some built tools but I need more flexibility.
If i could do someting like the following: "\u" + "ff1c" it would be great.
This is the closest I got:
char = "\u0000"
...
#within iteration
char.succ!
...
but after the character "\u0039", which is the number 9, I will get "10" instead of ":"
You could use pack to convert numbers to UTF8 characters but I'm not sure if this solves your problem.
You can either create an array with numeric values of all the characters and use pack to get an UTF8 string or you can just loop from 0 to whatever you need and use pack within the loop.
I've written a small example to explain myself. The code below prints out the hex value of each character followed by the character itself.
0.upto(100) do |i|
puts "%04x" % i + ": " + [i].pack("U*")
end
Here's some simpler code, albeit slightly obfuscated, that takes advantage of the fact that Ruby will convert an integer on the right hand side of the << operator to a codepoint. This only works with Ruby 1.8 up for integer values <= 255. It will work for values greater than 255 in 1.9.
0.upto(100) do |i|
puts "" << i
end

How can I output leading zeros in Ruby?

I'm outputting a set of numbered files from a Ruby script. The numbers come from incrementing a counter, but to make them sort nicely in the directory, I'd like to use leading zeros in the filenames. In other words
file_001...
instead of
file_1
Is there a simple way to add leading zeros when converting a number to a string? (I know I can do "if less than 10.... if less than 100").
Use the % operator with a string:
irb(main):001:0> "%03d" % 5
=> "005"
The left-hand-side is a printf format string, and the right-hand side can be a list of values, so you could do something like:
irb(main):002:0> filename = "%s/%s.%04d.txt" % ["dirname", "filename", 23]
=> "dirname/filename.0023.txt"
Here's a printf format cheat sheet you might find useful in forming your format string. The printf format is originally from the C function printf, but similar formating functions are available in perl, ruby, python, java, php, etc.
If the maximum number of digits in the counter is known (e.g., n = 3 for counters 1..876), you can do
str = "file_" + i.to_s.rjust(n, "0")
Can't you just use string formatting of the value before you concat the filename?
"%03d" % number
Use String#next as the counter.
>> n = "000"
>> 3.times { puts "file_#{n.next!}" }
file_001
file_002
file_003
next is relatively 'clever', meaning you can even go for
>> n = "file_000"
>> 3.times { puts n.next! }
file_001
file_002
file_003
As stated by the other answers, "%03d" % number works pretty well, but it goes against the rubocop ruby style guide:
Favor the use of sprintf and its alias format over the fairly
cryptic String#% method
We can obtain the same result in a more readable way using the following:
format('%03d', number)
filenames = '000'.upto('100').map { |index| "file_#{index}" }
Outputs
[file_000, file_001, file_002, file_003, ..., file_098, file_099, file_100]

Resources