Ruby Cyphering Leads to non Alphanumeric Characters [duplicate] - ruby

This question already has answers here:
Rotating letters in a string so that each letter is shifted to another letter by n places
(4 answers)
Closed 5 years ago.
I'm trying to make a basic cipher.
def caesar_crypto_encode(text, shift)
(text.nil? or text.strip.empty? ) ? "" : text.gsub(/[a-zA-Z]/){ |cstr|
((cstr.ord)+shift).chr }
end
but when the shift is too high I get these kinds of characters:
Test.assert_equals(caesar_crypto_encode("Hello world!", 127), "eBIIL TLOIA!")
Expected: "eBIIL TLOIA!", instead got: "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3!"
What is this format?

The reason you get the verbose output is because Ruby is running with UTF-8 encoding, and your conversion has just produced gibberish characters (an invalid character sequence under UTF-8 encoding).
ASCII characters A-Z are represented by decimal numbers (ordinals) 65-90, and a-z is 97-122. When you add 127 you push all the characters into 8-bit space, which makes them unrecognizable for proper UTF-8 encoding.
That's why Ruby inspect outputs the encoded strings in quoted form, which shows each character as its hexadecimal number "\xC7...".
If you want to get some semblance of characters out of this, you could re-encode the gibberish into ISO8859-1, which supports 8-bit characters.
Here's what you get if you do that:
s = "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3!"
>> s.encoding
=> #<Encoding:UTF-8>
# Re-encode as ISO8859-1.
# Your terminal (and Ruby) is using UTF-8, so Ruby will refuse to print these yet.
>> s.force_encoding('iso8859-1')
=> "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3!"
# In order to be able to print ISO8859-1 on an UTF-8 terminal, you have to
# convert them back to UTF-8 by re-encoding. This way your terminal (and Ruby)
# can display the ISO8859-1 8-bit characters using UTF-8 encoding:
>> s.encode('UTF-8')
=> "Çäëëî öîñëã!"
# Another way is just to repack the bytes into UTF-8:
>> s.bytes.pack('U*')
=> "Çäëëî öîñëã!"
Of course the proper way to do this, is not to let the numbers overflow into 8-bit space under any circumstance. Your encryption algorithm has a bug, and you need to ensure that the output is in the 7-bit ASCII range.
A better solution
Like #tadman suggested, you could use tr instead:
AZ_SEQUENCE = *'A'..'Z' + *'a'..'z'
"Hello world!".tr(AZ_SEQUENCE.join, AZ_SEQUENCE.rotate(127).join)
=> "eBIIL tLOIA!

I'm still curious about that format though...
Those characters represent the corresponding ASCII encoding after getting the ordinal (ord) of each letter and adding 127 to it (i.e. (cstr.ord)+shift).chr)
Why? Check Integer#chr, from the docs:
Returns a string containing the character represented by the int's
value according to encoding.
So, for example, take your first letter "H":
char_ord = "H".ord
#=> 72
new_char_ord = char_ord + 127
#=> 199
new_char_ord.chr
#=> "\xC7"
So, 199 corresponds to "\xC7". Keep changing all characters in "Hello world" and you will get "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3".
To avoid this you need to loop only with ord values that represent a letter (answer in the Possible duplicate link).

Related

How can I decipher a ciphered text?

I got the following ciphered string:
MDExMDExMTEwMTExMDAwMDAxMTAwMTAxMDExMDExMTAwMDEwMDAwMDAxMTEwMDExMDExMDAxMDEwMTExMDAxMTAx MTAwMDAxMDExMDExMDEwMTEwMDEwMQ==
Now I would like to decipher this string into original one. How can I do this? and I don't know about which algorithm is used to cipher the original string, the ciphered string has length of 121 characters.
Artjom B. already noted in a comment that trailing equal signs may indicate a Base64 encoding (a quick Google search reveals this, too). Fortunately, Ruby has a Base64 library to decode it:
require 'base64'
string = 'MDExMDExMTEwMTExMDAwMDAxMTAwMTAxMDExMDExMTAwMDEwMDAwMDAxMTEwMDExMDExMDAxMDEwMTExMDAxMTAx MTAwMDAxMDExMDExMDEwMTEwMDEwMQ=='
decoded = Base64.decode64(string)
#=> "0110111101110000011001010110111000100000011100110110010101110011011000010110110101100101"
The new string consists of 0's and 1's, apparently another encoding, this time a binary one. It could be ASCII characters. Let's take a look at the first 8 "bits":
decoded[0, 8] #=> "01101111"
Converted to a byte, i.e. an integer via to_i: (2 means binary)
decoded[0, 8].to_i(2) #=> 111
And finally to a character via chr:
decoded[0, 8].to_i(2).chr #=> "o"
Nice, "o" is a valid ASCII character, what about the following characters?
decoded[8, 8].to_i(2).chr #=> "p"
decoded[16, 8].to_i(2).chr #=> "e"
decoded[24, 8].to_i(2).chr #=> "n"
That's "open", an English word. I think we have something here. You can probably work out the rest yourself. And beware of the thieves ;-)
Difficult if you don't have any information about the algorithm, but the string looks like just base64 encoded, if you use a decoder you end up with
0110111101110000011001010110111000100000011100110110010101110011011000010110110101100101
don't know whether it makes sense or not

How to deal with Unicode strings in Ruby?

I've seen follownig construction in a tutorial of Ruby:
irb(main):001:0> "abc".each_byte { |c| printf "<%c>", c }
<a><b><c>=> "abc"
However, if I put string Здравствуйте! instead of abc, I get
irb(main):003:0> "Здравствуйте!".each_byte { |c| printf "<%c>", c }
<Ð><><Ð><´><Ñ><><Ð><°><Ð><²><Ñ><><Ñ><><Ð><²><Ñ><><Ð><¹><Ñ><><Ð><µ><!>=> "Здравствуйте!"
How to deal with Unicode strings?
irb(main):005:0> RUBY_VERSION
=> "1.9.3"
▶ "Здравствуйте!".each_char { |c| printf "<%c>", c }
# ⇒ <З><д><р><а><в><с><т><в><у><й><т><е><!>=> "Здравствуйте!"
Byte is byte, while char is char, consisting of bytes.
A byte is 8 bits. But unicode characters can take up multiple bytes when stored on your computer. So for example, lets say the integer code for some unicode character is 8,000, which is what is actually stored on your computer. When ruby reads in 8,000, ruby knows that represents some unicode character. However, 8,000 cannot be stored in one byte on your computer(the largest number that can be stored in one byte is 1111 1111, which is 255). If you tell ruby that each byte of the several bytes stored on your computer for 8,000 represents one character, i.e. by calling each_byte(), then ruby will never see the 8,000. Instead, ruby will read in a piece of 8,000 and think that represents one character, then read in another piece of 8,000 and think that represents another character.
each_byte() tells ruby to ignore the clusters of bytes, and just read in one byte at a time and then determine what character is represented by the integer stored in that byte.

Convert escaped unicode (\u008E) to accented character (Ž) in Ruby?

I am having a very difficult time with this:
# contained within:
"MA\u008EEIKIAI"
# should be
"MAŽEIKIAI"
# nature of string
$ p string3
"MA\u008EEIKIAI"
$ puts string3
MAEIKIAI
$ string3.inspect
"\"MA\\u008EEIKIAI\""
$ string3.bytes
#<Enumerator: "MA\u008EEIKIAI":bytes>
Any ideas on where to start?
Note: this is not a duplicate of my previous question.
\u008E means that the unicode character with the codepoint 8e (in hex) appears at that point in the string. This character is the control character “SINGLE SHIFT TWO” (see the code chart (pdf)). The character Ž is at the codepoint u017d. However it is at position 8e in the Windows CP-1252 encoding. Somehow you’ve got your encodings mixed up.
The easiest way to “fix” this is probably just to open the file containing the string (or the database record or whatever) and edit it to be correct. The real solution will depend on where the string in question came from and how many bad strings you have.
Assuming the string is in UTF-8 encoding, \u008E will consist of the two bytes c2 and 8e. Note that the second byte, 8e, is the same as the encoding of Ž in CP-1252. On way to convert the string would be something like this:
string3.force_encoding('BINARY') # treat the string just as bytes for now
string3.gsub!(/\xC2/n, '') # remove the C2 byte
string3.force_encoding('CP1252') # give the string the correct encoding
string3.encode('UTF-8') # convert to the desired encoding
Note that this isn’t a general solution to fix all issues like this. Not all CP-1252 characters, when mangled and expressed in UTF-8 this way will amenable to conversion like this. Some will be two bytes c2 xx where xx the correct byte (like in this case), others will be c3 yy where yy is a different byte.
What about using Regexp & String#pack to convert the Unicode escape?
str = "MA\\u008EEIKIAI"
puts str #=> MA\u008EEIKIAI
str.gsub!(/\\u(.{4})/) do |match|
[$1.to_i(16)].pack('U')
end
puts str #=> MA EIKIAI

Ruby unfamiliar string usage with Integer.chr and "\001"

Recently I stumbled over this code snippet in Ruby:
#data = 3.chr * 5
which results in "\003\003\003\003\003"
later in the code for example
flag = #data[2] & 2
is used,
I know that it has something todo with bitwise-flags. It seems the values 1,2 and 3 are used as state flags, but because ruby 1.9, which is the version I am familar with, changed the Integer.chr method the code does no longer work and I would really like to know whats going on.
Furthermore, what is the purpose of the "\00x" escaped-thing?
Thanks for your answers
To make the code work in Ruby 1.9, try changing that line to:
flag = #data[2].ord & 2
Prior to Ruby 1.9, str[n] would return an integer between 0 and 255, but in Ruby 1.9 with its new unicode support, str[n] returns a character (string of length 1). To get the integer instead of character, you can call .ord on the character.
The & operator is just the standard bitwise AND operator common to C, Ruby, and many other languages.
Byte number three (0x03) is not a printable ASCII character, so when you have that byte in a string and call inspect ruby denotes that byte as \003. Just make sure you understand that "\003" is a single-byte string while '\003' is a four-byte string.
In Ruby, strings are really sequences of bytes. In Ruby 1.9, there is also encoding information, but they are still really just a sequence of bytes.
The "\00X" thing is an octal representation of the value.
So if we do:
irb(main):001:0> 15.chr
=> "\017"
irb(main):002:0> 16.chr
=> "\020"
Notice how we went from 17 right to 20? Octal.
"\003\003\003\003\003" is 5 bytes of the value 3 and you can then bitwise and them with other bytes, such as 2 or \002.
So 3 or 0011 in binary anded with 2 (0010) is 2 (0010)
The 1.9 issue occurs on account of 1.9 not using ascii like 1.8 does. David Grayson hits that point well.
Note that ruby 1.9 will inspect unprintable characters in the hexadecimal representation:
3.chr # => "\x03"
Even more confusing is that sometimes the strings will appear in unicode (UTF-8):
"\003" # => "\u0003" (utf-8)
3.chr.encoding # => #<Encoding:US-ASCII>
"\003".encoding # => #<Encoding:UTF-8>
"\003" == 3.chr # => true (this is strange because the encoding is different)
If you're trying to understand how these octal and hex strings relate to decimal numbers, you can convert them to binary:
"\003".unpack('B*') # same as "\003".ord.to_s(2)
# => ["00000011"] # the 2 least significant bits are set
2.to_s(2) # convert to base 2
#=> "10"
The expression 3 & 2 is a bitwise-and of binary numbers 11b and 10b, which will yield 10b (because 1 & 1 is 1 for the most significant bit; 1 & 0 is 0 for least significant).
Other conversions:
'%x' % 97 # => '61' hex
0x61 # => 97 decimal from raw hex input
'%o' % 97 # => '141' octal
0141 # => 97 decimal from raw octal input
This is sort of a crash course but you should probably google for more in-depth info.

Ruby: Fuzzing through all unicode characters ‎(UTF8/Encoding/String Manipulation)

I can't iterate over the entire range of unicode characters.
I searched everywhere...
I am building a fuzzer and want to embed into a url, all unicode characters (one at a time).
For example:
http://www.example.com?a=\uff1c
I know that there are some built tools but I need more flexibility.
If i could do someting like the following: "\u" + "ff1c" it would be great.
This is the closest I got:
char = "\u0000"
...
#within iteration
char.succ!
...
but after the character "\u0039", which is the number 9, I will get "10" instead of ":"
You could use pack to convert numbers to UTF8 characters but I'm not sure if this solves your problem.
You can either create an array with numeric values of all the characters and use pack to get an UTF8 string or you can just loop from 0 to whatever you need and use pack within the loop.
I've written a small example to explain myself. The code below prints out the hex value of each character followed by the character itself.
0.upto(100) do |i|
puts "%04x" % i + ": " + [i].pack("U*")
end
Here's some simpler code, albeit slightly obfuscated, that takes advantage of the fact that Ruby will convert an integer on the right hand side of the << operator to a codepoint. This only works with Ruby 1.8 up for integer values <= 255. It will work for values greater than 255 in 1.9.
0.upto(100) do |i|
puts "" << i
end

Resources