Cannot convert ISO8859-1 to cyrillic in ruby - ruby

I have text "ÐоÑÑинаÑ", and I want to convert it to cyrillic. 2cyr.com says that this is ISO8859-1 format. I tried
"ÐоÑÑинаÑ".force_encoding("ISO8859-1").encode("UTF-8")
But it returned =>
"Ã\u0090Â\u0093Ã\u0090¾Ã\u0091Â\u0081Ã\u0091Â\u0082Ã\u0090¸Ã\u0090½Ã\u0090°Ã\u0091Â\u008F"
What should I do to make the final word be "Гостиная"

It's the other way round. Your string is the result of:
str = "Гостиная".force_encoding('ISO8859-1').encode('UTF-8')
#=> "Ð\u0093оÑ\u0081Ñ\u0082инаÑ\u008F"
puts str
#=> ÐоÑÑинаÑ
To revert it, use:
str.encode('ISO8859-1').force_encoding('UTF-8')
#=> "Гостиная"
Of course, this only works if the malformed string is left intact (it contains several invisible / unprintable characters).

Best you can do is switch the order of methods:
puts "ÐоÑÑинаÑ".encode("CP1252")
#=> �о��ина�
Your string still contains broken chars, but that is likely to be inherent to your original string. Online tools like this one give the same result.

Related

How to print an escape character in Ruby?

I have a string containing an escape character:
word = "x\nz"
and I would like to print it as x\nz.
However, puts word gives me:
x
z
How do I get puts word to output x\nz instead of creating a new line?
Use String#inspect
puts word.inspect #=> "x\nz"
Or just p
p word #=> "x\nz"
I have a string containing an escape character:
No, you don't. You have a string containing a newline.
How do I get puts word to output x\nz instead of creating a new line?
The easiest way would be to just create the string in the format you want in the first place:
word = 'x\nz'
# or
word = "x\\nz"
If that isn't possible, you can translate the string the way you want:
word = word.gsub("\n", '\n')
# or
word.gsub!("\n", '\n')
You may be tempted to do something like
puts word.inspect
# or
p word
Don't do that! #inspect is not guaranteed to have any particular format. The only requirement it has, is that it should return a human-readable string representation that is suitable for debugging. You should never rely on the content of #inspect, the only thing you should rely on, is that it is human readable.

Convert string with hex ASCII codes to characters

I have a string containing hex code values of ASCII characters, e.g. "666f6f626172". I want to convert it to the corresponding string ("foobar").
This is working but ugly:
"666f6f626172".scan(/../).map(&:hex).map(&:chr).join # => "foobar"
Is there a better (more concise) way? Could unpack be helpful somehow?
You can use Array#pack:
["666f6f626172"].pack('H*')
#=> "foobar"
H is the directive for a hex string (high nibble first).
Stefan has nailed it, but here's an alternative you may want to tuck away for another time and place:
"666f6f626172".gsub(/../) { |pair| pair.hex.chr } # => "foobar"

How to validate that a string is a proper hexadecimal value in Ruby?

I am writing a 6502 assembler in Ruby. I am looking for a way to validate hexadecimal operands in string form. I understand that the String object provides a "hex" method to return a number, but here's a problem I run into:
"0A".hex #=> 10 - a valid hexadecimal value
"0Z".hex #=> 0 - invalid, produces a zero
"asfd".hex #=> 10 - Why 10? I guess it reads 'a' first and stops at 's'?
You will get some odd results by typing in a bunch of gibberish. What I need is a way to first verify that the value is a legit hex string.
I was playing around with regular expressions, and realized I can do this:
true if "0A" =~ /[A-Fa-f0-9]/
#=> true
true if "0Z" =~ /[A-Fa-f0-9]/
#=> true <-- PROBLEM
I'm not sure how to address this issue. I need to be able to verify that letters are only A-F and that if it is just numbers that is ok too.
I'm hoping to avoid spaghetti code, riddled with "if" statements. I am hoping that someone could provide a "one-liner" or some form of elegent code.
Thanks!
!str[/\H/] will look for invalid hex values.
String#hex does not interpret the whole string as hex, it extracts from the beginning of the string up to as far as it can be interpreted as hex. With "0Z", the "0" is valid hex, so it interpreted that part. With "asfd", the "a" is valid hex, so it interpreted that part.
One method:
str.to_i(16).to_s(16) == str.downcase
Another:
str =~ /\A[a-f0-9]+\Z/i # or simply /\A\h+\Z/ (see hirolau's answer)
About your regex, you have to use anchors (\A for begin of string and \Z for end of string) to say that you want the full string to match. Also, the + repeats the match for one or more characters.
Note that you could use ^ (begin of line) and $ (end of line), but this would allow strings like "something\n0A" to pass.
This is an old question, but I just had the issue myself. I opted for this in my code:
str =~ /^\h+$/
It has the added benefit of returning nil if str is nil.
Since Ruby has literal hex built-in, you can eval the string and rescue the SyntaxError
eval "0xA" => 10
eval "0xZ" => SyntaxError
You can use this on a method like
def is_hex?(str)
begin
eval("0x#{str}")
true
rescue SyntaxError
false
end
end
is_hex?('0A') => true
is_hex?('0Z') => false
Of course since you are using eval, make sure you are sending only safe values to the methods

Convert matched string of UTF-8 values to UTF-8 characters in Ruby

Trying to convert output from a rest_client GET to the characters that are represented with escape sequences.
Input: ..."sub_id":"\u0d9c\u8138\u8134\u3f30\u8139\u2b71"...
(which I put in 'all_subs')
Match: m = /sub_id\"\:\"([^\"]+)\"/.match(all_subs.to_str) [1]
Print: puts m.force_encoding("UTF-8").unpack('U*').pack('U*')
But it just comes out the same way I put it in. ie, "\u0d9c\u8138\u8134\u3f30\u8139\u2b71"
However, if I convert a raw string of it:
puts "\u0d9c\u8138\u8134\u3f30\u8139\u2b71".unpack('U*').pack('U*')
The output is perfect as "ග脸脴㼰脹⭱"
What you're getting when you parse the input string is actually this:
m = "\\u0d9c\\u8138\\u8134\\u3f30\\u8139\\u2b71"
Which is not the same as:
"\u0d9c\u8138\u8134\u3f30\u8139\u2b71"
Therefore one option is to eval the string so that ruby applies the codepoints:
puts eval("\"#{m}\"")
=> ග脸脴㼰脹
However note that there are security implications when running eval.
If the string is always like in your example. You could also do something like this, which is safe:
puts m.split("\\u")[1..-1].map { |c| c.to_i(16) }.pack("U*")
=> ග脸脴㼰脹

How can I convert a string of codepoints to the string it represents?

I have a string (in Ruby) like this:
626c6168
(that is 'blah' without the quotes)
How do I convert it to 'blah'? Note that these are variable lengths, and also they aren't always letters and numbers. (They're being stored in a database, not being printed.)
Array#pack
['626c6168'].pack('H*')
# => "blah"
Using hex to convert each character:
"626c6168".scan(/../).map{ |c| c.hex.chr }.join
This gives blah.

Resources