Ruby unfamiliar string usage with Integer.chr and "\001" - ruby

Recently I stumbled over this code snippet in Ruby:
#data = 3.chr * 5
which results in "\003\003\003\003\003"
later in the code for example
flag = #data[2] & 2
is used,
I know that it has something todo with bitwise-flags. It seems the values 1,2 and 3 are used as state flags, but because ruby 1.9, which is the version I am familar with, changed the Integer.chr method the code does no longer work and I would really like to know whats going on.
Furthermore, what is the purpose of the "\00x" escaped-thing?
Thanks for your answers

To make the code work in Ruby 1.9, try changing that line to:
flag = #data[2].ord & 2
Prior to Ruby 1.9, str[n] would return an integer between 0 and 255, but in Ruby 1.9 with its new unicode support, str[n] returns a character (string of length 1). To get the integer instead of character, you can call .ord on the character.
The & operator is just the standard bitwise AND operator common to C, Ruby, and many other languages.
Byte number three (0x03) is not a printable ASCII character, so when you have that byte in a string and call inspect ruby denotes that byte as \003. Just make sure you understand that "\003" is a single-byte string while '\003' is a four-byte string.
In Ruby, strings are really sequences of bytes. In Ruby 1.9, there is also encoding information, but they are still really just a sequence of bytes.

The "\00X" thing is an octal representation of the value.
So if we do:
irb(main):001:0> 15.chr
=> "\017"
irb(main):002:0> 16.chr
=> "\020"
Notice how we went from 17 right to 20? Octal.
"\003\003\003\003\003" is 5 bytes of the value 3 and you can then bitwise and them with other bytes, such as 2 or \002.
So 3 or 0011 in binary anded with 2 (0010) is 2 (0010)
The 1.9 issue occurs on account of 1.9 not using ascii like 1.8 does. David Grayson hits that point well.

Note that ruby 1.9 will inspect unprintable characters in the hexadecimal representation:
3.chr # => "\x03"
Even more confusing is that sometimes the strings will appear in unicode (UTF-8):
"\003" # => "\u0003" (utf-8)
3.chr.encoding # => #<Encoding:US-ASCII>
"\003".encoding # => #<Encoding:UTF-8>
"\003" == 3.chr # => true (this is strange because the encoding is different)
If you're trying to understand how these octal and hex strings relate to decimal numbers, you can convert them to binary:
"\003".unpack('B*') # same as "\003".ord.to_s(2)
# => ["00000011"] # the 2 least significant bits are set
2.to_s(2) # convert to base 2
#=> "10"
The expression 3 & 2 is a bitwise-and of binary numbers 11b and 10b, which will yield 10b (because 1 & 1 is 1 for the most significant bit; 1 & 0 is 0 for least significant).
Other conversions:
'%x' % 97 # => '61' hex
0x61 # => 97 decimal from raw hex input
'%o' % 97 # => '141' octal
0141 # => 97 decimal from raw octal input
This is sort of a crash course but you should probably google for more in-depth info.

Related

Use Ruby to generate hex codepoints for Unicode values

I'm looking to use Ruby to convert codepoints to values that are easy to look up in Unicode references.
I know that I can get the codepoint itself using String#codepoints, ("a".codepoints => [97]) but I am looking to output the following via a few methods, let's call them convert_unicode, convert_unicode_to_hex, and convert_unicode_to_codepoints for the sake of this question:
character = "a"
character.codepoints => [97]
convert_unicode("97") => "U+0061"
convert_unicode_to_hex("U+0061") => 0x61
convert_unicode_to_codepoints("U+0061") => 97
I tried using 97.to_s(16) but then got in a mess when I was adding padding of 0's, because another example unicode I'd like this to work for is U+1F028. How would you approach this problem?
You could use format:
format('U+%04X', 97) #=> "U+0061"
format('U+%04X', 127016) #=> "U+1F028"
U+ interpreted literally
% start of format sequence
0 pad with zeros
4 minimum width of 4 characters
X convert argument to uppercase hex number
Use String#rjust:
[97, 127016].map { |i| "U+" << i.to_s(16).upcase.rjust(4, '0') }
#⇒ ["U+0061", "U+1F028"]
For other operations:
"U+0061"[/(?<=\AU\+).*/].to_i(16)
#⇒ 97
"U+0061"[/(?<=\AU\+).*/].prepend('0x')
#⇒ "0x0061"
NB: 0x61 might live as string only, since 0x61 and 97 are the same value internally, both represented by 97.

Ruby Cyphering Leads to non Alphanumeric Characters [duplicate]

This question already has answers here:
Rotating letters in a string so that each letter is shifted to another letter by n places
(4 answers)
Closed 5 years ago.
I'm trying to make a basic cipher.
def caesar_crypto_encode(text, shift)
(text.nil? or text.strip.empty? ) ? "" : text.gsub(/[a-zA-Z]/){ |cstr|
((cstr.ord)+shift).chr }
end
but when the shift is too high I get these kinds of characters:
Test.assert_equals(caesar_crypto_encode("Hello world!", 127), "eBIIL TLOIA!")
Expected: "eBIIL TLOIA!", instead got: "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3!"
What is this format?
The reason you get the verbose output is because Ruby is running with UTF-8 encoding, and your conversion has just produced gibberish characters (an invalid character sequence under UTF-8 encoding).
ASCII characters A-Z are represented by decimal numbers (ordinals) 65-90, and a-z is 97-122. When you add 127 you push all the characters into 8-bit space, which makes them unrecognizable for proper UTF-8 encoding.
That's why Ruby inspect outputs the encoded strings in quoted form, which shows each character as its hexadecimal number "\xC7...".
If you want to get some semblance of characters out of this, you could re-encode the gibberish into ISO8859-1, which supports 8-bit characters.
Here's what you get if you do that:
s = "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3!"
>> s.encoding
=> #<Encoding:UTF-8>
# Re-encode as ISO8859-1.
# Your terminal (and Ruby) is using UTF-8, so Ruby will refuse to print these yet.
>> s.force_encoding('iso8859-1')
=> "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3!"
# In order to be able to print ISO8859-1 on an UTF-8 terminal, you have to
# convert them back to UTF-8 by re-encoding. This way your terminal (and Ruby)
# can display the ISO8859-1 8-bit characters using UTF-8 encoding:
>> s.encode('UTF-8')
=> "Çäëëî öîñëã!"
# Another way is just to repack the bytes into UTF-8:
>> s.bytes.pack('U*')
=> "Çäëëî öîñëã!"
Of course the proper way to do this, is not to let the numbers overflow into 8-bit space under any circumstance. Your encryption algorithm has a bug, and you need to ensure that the output is in the 7-bit ASCII range.
A better solution
Like #tadman suggested, you could use tr instead:
AZ_SEQUENCE = *'A'..'Z' + *'a'..'z'
"Hello world!".tr(AZ_SEQUENCE.join, AZ_SEQUENCE.rotate(127).join)
=> "eBIIL tLOIA!
I'm still curious about that format though...
Those characters represent the corresponding ASCII encoding after getting the ordinal (ord) of each letter and adding 127 to it (i.e. (cstr.ord)+shift).chr)
Why? Check Integer#chr, from the docs:
Returns a string containing the character represented by the int's
value according to encoding.
So, for example, take your first letter "H":
char_ord = "H".ord
#=> 72
new_char_ord = char_ord + 127
#=> 199
new_char_ord.chr
#=> "\xC7"
So, 199 corresponds to "\xC7". Keep changing all characters in "Hello world" and you will get "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3".
To avoid this you need to loop only with ord values that represent a letter (answer in the Possible duplicate link).

How to deal with Unicode strings in Ruby?

I've seen follownig construction in a tutorial of Ruby:
irb(main):001:0> "abc".each_byte { |c| printf "<%c>", c }
<a><b><c>=> "abc"
However, if I put string Здравствуйте! instead of abc, I get
irb(main):003:0> "Здравствуйте!".each_byte { |c| printf "<%c>", c }
<Ð><><Ð><´><Ñ><><Ð><°><Ð><²><Ñ><><Ñ><><Ð><²><Ñ><><Ð><¹><Ñ><><Ð><µ><!>=> "Здравствуйте!"
How to deal with Unicode strings?
irb(main):005:0> RUBY_VERSION
=> "1.9.3"
▶ "Здравствуйте!".each_char { |c| printf "<%c>", c }
# ⇒ <З><д><р><а><в><с><т><в><у><й><т><е><!>=> "Здравствуйте!"
Byte is byte, while char is char, consisting of bytes.
A byte is 8 bits. But unicode characters can take up multiple bytes when stored on your computer. So for example, lets say the integer code for some unicode character is 8,000, which is what is actually stored on your computer. When ruby reads in 8,000, ruby knows that represents some unicode character. However, 8,000 cannot be stored in one byte on your computer(the largest number that can be stored in one byte is 1111 1111, which is 255). If you tell ruby that each byte of the several bytes stored on your computer for 8,000 represents one character, i.e. by calling each_byte(), then ruby will never see the 8,000. Instead, ruby will read in a piece of 8,000 and think that represents one character, then read in another piece of 8,000 and think that represents another character.
each_byte() tells ruby to ignore the clusters of bytes, and just read in one byte at a time and then determine what character is represented by the integer stored in that byte.

How convert string in Ruby

Lets say we have the string '\342\200\231' (same as "\\342\\200\\231"). What is a quick way to convert this string to "\342\200\231" (same as ’ Unicode character)?
Proposal:
s.gsub(/\\(\d{3})/) { $1.oct.chr }
It depends on what assumptions you can make about your input.
What you appear to be asking is how to change a 12-character string into a three-character string.
'\342\200\231'
is 12 characters long.
"\342\200\231"
is three characters long; actually three bytes long, but in Ruby 1.8 it is about the same since strings are sequences of bytes anyway.
Here is an EVIL answer for you (you did say quick), which takes advantage of eval to do your "parsing":
irb(main):017:0> s = '\342\200\231'
=> "\\342\\200\\231"
irb(main):018:0> t = eval('"' + s + '"')
=> "\342\200\231"
irb(main):019:0> s.length
=> 12
irb(main):020:0> t.length
=> 3
Sorry for the eval!
I should probably give a more helpful answer... EDIT: Someone else just did.

Converting a hexadecimal number to binary in ruby

I am trying to convert a hex value to a binary value (each bit in the hex string should have an equivalent four bit binary value). I was advised to use this:
num = "0ff" # (say for eg.)
bin = "%0#{num.size*4}b" % num.hex.to_i
This gives me the correct output 000011111111. I am confused with how this works, especially %0#{num.size*4}b. Could someone help me with this?
You can also do:
num = "0ff"
num.hex.to_s(2).rjust(num.size*4, '0')
You may have already figured out, but, num.size*4 is the number of digits that you want to pad the output up to with 0 because one hexadecimal digit is represented by four (log_2 16 = 4) binary digits.
You'll find the answer in the documentation of Kernel#sprintf (as pointed out by the docs for String#%):
http://www.ruby-doc.org/core/classes/Kernel.html#M001433
This is the most straightforward solution I found to convert from hexadecimal to binary:
['DEADBEEF'].pack('H*').unpack('B*').first # => "11011110101011011011111011101111"
And from binary to hexadecimal:
['11011110101011011011111011101111'].pack('B*').unpack1('H*') # => "deadbeef"
Here you can find more information:
Array#pack: https://ruby-doc.org/core-2.7.1/Array.html#method-i-pack
String#unpack1 (similar to unpack): https://ruby-doc.org/core-2.7.1/String.html#method-i-unpack1
This doesn't answer your original question, but I would assume that a lot of people coming here are, instead of looking to turn hexadecimal to actual "0s and 1s" binary output, to decode hexadecimal to a byte string representation (in the spirit of such utilities as hex2bin). As such, here is a good method for doing exactly that:
def hex_to_bin(hex)
# Prepend a '0' for padding if you don't have an even number of chars
hex = '0' << hex unless (hex.length % 2) == 0
hex.scan(/[A-Fa-f0-9]{2}/).inject('') { |encoded, byte| encoded << [byte].pack('H2') }
end
Getting back to hex again is much easier:
def bin_to_hex(bin)
bin.unpack('H*').first
end
Converting the string of hex digits back to binary is just as easy. Take the hex digits two at a time (since each byte can range from 00 to FF), convert the digits to a character, and join them back together.
def hex_to_bin(s) s.scan(/../).map { |x| x.hex.chr }.join end

Resources