How to deal with Unicode strings in Ruby? - ruby

I've seen follownig construction in a tutorial of Ruby:
irb(main):001:0> "abc".each_byte { |c| printf "<%c>", c }
<a><b><c>=> "abc"
However, if I put string Здравствуйте! instead of abc, I get
irb(main):003:0> "Здравствуйте!".each_byte { |c| printf "<%c>", c }
<Ð><><Ð><´><Ñ><><Ð><°><Ð><²><Ñ><><Ñ><><Ð><²><Ñ><><Ð><¹><Ñ><><Ð><µ><!>=> "Здравствуйте!"
How to deal with Unicode strings?
irb(main):005:0> RUBY_VERSION
=> "1.9.3"

▶ "Здравствуйте!".each_char { |c| printf "<%c>", c }
# ⇒ <З><д><р><а><в><с><т><в><у><й><т><е><!>=> "Здравствуйте!"
Byte is byte, while char is char, consisting of bytes.

A byte is 8 bits. But unicode characters can take up multiple bytes when stored on your computer. So for example, lets say the integer code for some unicode character is 8,000, which is what is actually stored on your computer. When ruby reads in 8,000, ruby knows that represents some unicode character. However, 8,000 cannot be stored in one byte on your computer(the largest number that can be stored in one byte is 1111 1111, which is 255). If you tell ruby that each byte of the several bytes stored on your computer for 8,000 represents one character, i.e. by calling each_byte(), then ruby will never see the 8,000. Instead, ruby will read in a piece of 8,000 and think that represents one character, then read in another piece of 8,000 and think that represents another character.
each_byte() tells ruby to ignore the clusters of bytes, and just read in one byte at a time and then determine what character is represented by the integer stored in that byte.

Related

How do I print a hex number representing a IEEE 754 float as a float in ruby

I am using ruby to parse a datastream, some parts of which are IEEE-754 floats, but am not sure how to print these as floats. For example:
f = 0xbe80fd31 # -0.2519317
puts "%f" % f
3196124465.000000
how do I get -0.2519317 ?
Any time your converting a binary byte stream to something else, you usually end up using String#unpack (and Array#pack if you're going the other way).
If you have these bytes:
bytes = [0xbe, 0x80, 0xfd, 0x31]
then you could say:
bytes.map(&:chr).join.unpack('g')
# [-0.25193169713020325]
and then unwrap the array. This:
bytes.map(&:chr).join
packs the bytes into the string:
"\xbe\x80\xfd\x31"
which is suitable for #unpack. You could also (thanks Stefan) say:
# Variations on getting the bytes into a string for `#unpack`
bytes.pack('C4').unpack('g').first
[0xbe80fd31].pack('L>').unpack('g').first
# Variations using `#unpack1`
bytes.map(&:chr).join.unpack1('g')
bytes.pack('C4').unpack1('g')
[0xbe80fd31].pack('L>').unpack1('g')
If you already have the string then you go can straight to #unpack or #unpack1.
You'll want to use 'e' instead of 'g' your bytes are in a different order and 'E' or 'G' if you actually have an eight byte double rather than a four byte float.

How can I convert a UUID to a string using a custom character set in Ruby?

I want to create a valid IFC GUID (IfcGloballyUniqueId) according to the specification here:
http://www.buildingsmart-tech.org/ifc/IFC2x3/TC1/html/ifcutilityresource/lexical/ifcgloballyuniqueid.htm
It's basically a UUID or GUID (128 bit) mapped to a set of 22 characters to limit storage space in a text file.
I currently have this workaround, but it's merely an approximation:
guid = '';22.times{|i|guid<<'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_$'[rand(64)]}
It seems best to use ruby SecureRandom to generate a 128 bit UUID, like in this example (https://ruby-doc.org/stdlib-2.3.0/libdoc/securerandom/rdoc/SecureRandom.html):
SecureRandom.uuid #=> "2d931510-d99f-494a-8c67-87feb05e1594"
This UUID needs to be mapped to a string with a length of 22 characters according to this format:
1 2 3 4 5 6
0123456789012345678901234567890123456789012345678901234567890123
"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_$";
I don't understand this exactly.
Should the 32-character long hex-number be converted to a 128-character long binary number, then devided in 22 sets of 6 bits(except for one that gets the remaining 2 bits?) for which each can be converted to a decimal number from 0 to 64? Which then in turn can be replaced by the corresponding character from the conversion table?
I hope someone can verify if I'm on the right track here.
And if I am, is there a computational faster way in Ruby to convert the 128 bit number to the 22 sets of 0-64 than using all these separate conversions?
Edit: For anyone having the same problem, this is my solution for now:
require 'securerandom'
# possible characters in GUID
guid64 = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_$'
guid = ""
# SecureRandom.uuid: creates a 128 bit UUID hex string
# tr('-', ''): removes the dashes from the hex string
# pack('H*'): converts the hex string to a binary number (high nibble first) (?) is this correct?
# This reverses the number so we end up with the leftover bit on the end, which helps with chopping the sting into pieces.
# It needs to be reversed again to end up with a string in the original order.
# unpack('b*'): converts the binary number to a bit string (128 0's and 1's) and places it into an array
# [0]: gets the first (and only) value from the array
# to_s.scan(/.{1,6}/m): chops the string into pieces 6 characters(bits) with the leftover on the end.
[SecureRandom.uuid.tr('-', '')].pack('H*').unpack('b*')[0].to_s.scan(/.{1,6}/m).each do |num|
# take the number (0 - 63) and find the matching character in guid64, add the found character to the guid string
guid << guid64[num.to_i(2)]
end
guid.reverse
Base64 encoding is pretty close to what you want here, but the mappings are different. No big deal, you can fix that:
require 'securerandom'
require 'base64'
# Define the two mappings here, side-by-side
BASE64 = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/'
IFCB64 = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_$'
def ifcb64(hex)
# Convert from hex to binary, then from binary to Base64
# Trim off the == padding, then convert mappings with `tr`
Base64.encode64([ hex.tr('-', '') ].pack('H*')).gsub(/\=*\n/, '').tr(BASE64, IFCB64)
end
ifcb64(SecureRandom.uuid)
# => "fa9P7E3qJEc1tPxgUuPZHm"

Ruby Cyphering Leads to non Alphanumeric Characters [duplicate]

This question already has answers here:
Rotating letters in a string so that each letter is shifted to another letter by n places
(4 answers)
Closed 5 years ago.
I'm trying to make a basic cipher.
def caesar_crypto_encode(text, shift)
(text.nil? or text.strip.empty? ) ? "" : text.gsub(/[a-zA-Z]/){ |cstr|
((cstr.ord)+shift).chr }
end
but when the shift is too high I get these kinds of characters:
Test.assert_equals(caesar_crypto_encode("Hello world!", 127), "eBIIL TLOIA!")
Expected: "eBIIL TLOIA!", instead got: "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3!"
What is this format?
The reason you get the verbose output is because Ruby is running with UTF-8 encoding, and your conversion has just produced gibberish characters (an invalid character sequence under UTF-8 encoding).
ASCII characters A-Z are represented by decimal numbers (ordinals) 65-90, and a-z is 97-122. When you add 127 you push all the characters into 8-bit space, which makes them unrecognizable for proper UTF-8 encoding.
That's why Ruby inspect outputs the encoded strings in quoted form, which shows each character as its hexadecimal number "\xC7...".
If you want to get some semblance of characters out of this, you could re-encode the gibberish into ISO8859-1, which supports 8-bit characters.
Here's what you get if you do that:
s = "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3!"
>> s.encoding
=> #<Encoding:UTF-8>
# Re-encode as ISO8859-1.
# Your terminal (and Ruby) is using UTF-8, so Ruby will refuse to print these yet.
>> s.force_encoding('iso8859-1')
=> "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3!"
# In order to be able to print ISO8859-1 on an UTF-8 terminal, you have to
# convert them back to UTF-8 by re-encoding. This way your terminal (and Ruby)
# can display the ISO8859-1 8-bit characters using UTF-8 encoding:
>> s.encode('UTF-8')
=> "Çäëëî öîñëã!"
# Another way is just to repack the bytes into UTF-8:
>> s.bytes.pack('U*')
=> "Çäëëî öîñëã!"
Of course the proper way to do this, is not to let the numbers overflow into 8-bit space under any circumstance. Your encryption algorithm has a bug, and you need to ensure that the output is in the 7-bit ASCII range.
A better solution
Like #tadman suggested, you could use tr instead:
AZ_SEQUENCE = *'A'..'Z' + *'a'..'z'
"Hello world!".tr(AZ_SEQUENCE.join, AZ_SEQUENCE.rotate(127).join)
=> "eBIIL tLOIA!
I'm still curious about that format though...
Those characters represent the corresponding ASCII encoding after getting the ordinal (ord) of each letter and adding 127 to it (i.e. (cstr.ord)+shift).chr)
Why? Check Integer#chr, from the docs:
Returns a string containing the character represented by the int's
value according to encoding.
So, for example, take your first letter "H":
char_ord = "H".ord
#=> 72
new_char_ord = char_ord + 127
#=> 199
new_char_ord.chr
#=> "\xC7"
So, 199 corresponds to "\xC7". Keep changing all characters in "Hello world" and you will get "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3".
To avoid this you need to loop only with ord values that represent a letter (answer in the Possible duplicate link).

Ruby unfamiliar string usage with Integer.chr and "\001"

Recently I stumbled over this code snippet in Ruby:
#data = 3.chr * 5
which results in "\003\003\003\003\003"
later in the code for example
flag = #data[2] & 2
is used,
I know that it has something todo with bitwise-flags. It seems the values 1,2 and 3 are used as state flags, but because ruby 1.9, which is the version I am familar with, changed the Integer.chr method the code does no longer work and I would really like to know whats going on.
Furthermore, what is the purpose of the "\00x" escaped-thing?
Thanks for your answers
To make the code work in Ruby 1.9, try changing that line to:
flag = #data[2].ord & 2
Prior to Ruby 1.9, str[n] would return an integer between 0 and 255, but in Ruby 1.9 with its new unicode support, str[n] returns a character (string of length 1). To get the integer instead of character, you can call .ord on the character.
The & operator is just the standard bitwise AND operator common to C, Ruby, and many other languages.
Byte number three (0x03) is not a printable ASCII character, so when you have that byte in a string and call inspect ruby denotes that byte as \003. Just make sure you understand that "\003" is a single-byte string while '\003' is a four-byte string.
In Ruby, strings are really sequences of bytes. In Ruby 1.9, there is also encoding information, but they are still really just a sequence of bytes.
The "\00X" thing is an octal representation of the value.
So if we do:
irb(main):001:0> 15.chr
=> "\017"
irb(main):002:0> 16.chr
=> "\020"
Notice how we went from 17 right to 20? Octal.
"\003\003\003\003\003" is 5 bytes of the value 3 and you can then bitwise and them with other bytes, such as 2 or \002.
So 3 or 0011 in binary anded with 2 (0010) is 2 (0010)
The 1.9 issue occurs on account of 1.9 not using ascii like 1.8 does. David Grayson hits that point well.
Note that ruby 1.9 will inspect unprintable characters in the hexadecimal representation:
3.chr # => "\x03"
Even more confusing is that sometimes the strings will appear in unicode (UTF-8):
"\003" # => "\u0003" (utf-8)
3.chr.encoding # => #<Encoding:US-ASCII>
"\003".encoding # => #<Encoding:UTF-8>
"\003" == 3.chr # => true (this is strange because the encoding is different)
If you're trying to understand how these octal and hex strings relate to decimal numbers, you can convert them to binary:
"\003".unpack('B*') # same as "\003".ord.to_s(2)
# => ["00000011"] # the 2 least significant bits are set
2.to_s(2) # convert to base 2
#=> "10"
The expression 3 & 2 is a bitwise-and of binary numbers 11b and 10b, which will yield 10b (because 1 & 1 is 1 for the most significant bit; 1 & 0 is 0 for least significant).
Other conversions:
'%x' % 97 # => '61' hex
0x61 # => 97 decimal from raw hex input
'%o' % 97 # => '141' octal
0141 # => 97 decimal from raw octal input
This is sort of a crash course but you should probably google for more in-depth info.

Converting a hexadecimal number to binary in ruby

I am trying to convert a hex value to a binary value (each bit in the hex string should have an equivalent four bit binary value). I was advised to use this:
num = "0ff" # (say for eg.)
bin = "%0#{num.size*4}b" % num.hex.to_i
This gives me the correct output 000011111111. I am confused with how this works, especially %0#{num.size*4}b. Could someone help me with this?
You can also do:
num = "0ff"
num.hex.to_s(2).rjust(num.size*4, '0')
You may have already figured out, but, num.size*4 is the number of digits that you want to pad the output up to with 0 because one hexadecimal digit is represented by four (log_2 16 = 4) binary digits.
You'll find the answer in the documentation of Kernel#sprintf (as pointed out by the docs for String#%):
http://www.ruby-doc.org/core/classes/Kernel.html#M001433
This is the most straightforward solution I found to convert from hexadecimal to binary:
['DEADBEEF'].pack('H*').unpack('B*').first # => "11011110101011011011111011101111"
And from binary to hexadecimal:
['11011110101011011011111011101111'].pack('B*').unpack1('H*') # => "deadbeef"
Here you can find more information:
Array#pack: https://ruby-doc.org/core-2.7.1/Array.html#method-i-pack
String#unpack1 (similar to unpack): https://ruby-doc.org/core-2.7.1/String.html#method-i-unpack1
This doesn't answer your original question, but I would assume that a lot of people coming here are, instead of looking to turn hexadecimal to actual "0s and 1s" binary output, to decode hexadecimal to a byte string representation (in the spirit of such utilities as hex2bin). As such, here is a good method for doing exactly that:
def hex_to_bin(hex)
# Prepend a '0' for padding if you don't have an even number of chars
hex = '0' << hex unless (hex.length % 2) == 0
hex.scan(/[A-Fa-f0-9]{2}/).inject('') { |encoded, byte| encoded << [byte].pack('H2') }
end
Getting back to hex again is much easier:
def bin_to_hex(bin)
bin.unpack('H*').first
end
Converting the string of hex digits back to binary is just as easy. Take the hex digits two at a time (since each byte can range from 00 to FF), convert the digits to a character, and join them back together.
def hex_to_bin(s) s.scan(/../).map { |x| x.hex.chr }.join end

Resources