Convert unicode mess to correct characters in Ruby? - ruby

I have a string such as:
"MÃ\u0083¼LLER".encoding
#<Encoding:UTF-8>
"MÃ\u0083¼LLER".inspect
"\"MÃ\\u0083¼LLER\""
What can I do to salvage such a string? Take into consideration I do not have the original data. Is this salvageable?

Looks like the string was converted from utf-8 to latin-1 twice. Try this on some of your data and let me know if it worked:
require 'iconv'
def decode(str)
i = Iconv.new('LATIN1','UTF-8')
i.iconv(i.iconv(str)).force_encoding('UTF-8')
end
decode("MÃ\u0083¼LLER")
#=> "MüLLER"

Related

String replace with strange characters when attributed to a HASH

I'm testing 2 situations and getting 2 strangely different results.
First:
hash_data_file = CSV.parse(data_file).map {|line|
puts line[6]
abort
The return is Caixa Econômica Federal with accents in the right place.
Second:
hash_data_file = CSV.parse(data_file).map {|line|
puts :bank => line[6]
abort
But the return is {:bank=>"Caixa Econ\xC3\xB4mica Federal"}, a string with errors in the codification instead of the accents.
What am I doing wrong?
In the first case, your data_file is in UTF-8 encoding. In the second case, data_file has binary (i.e. 7-bit ASCII) encoding.
For example, if we start with a simple UTF-8 CSV file:
bank
Caixa Econômica Federal
and then parse it with UTF-8 encoding:
CSV.parse(File.open('pancakes.csv', encoding: 'utf-8'))
# [["bank"], ["Caixa Econômica Federal"]]
and then in binary encoding:
CSV.parse(File.open('pancakes.csv', encoding: 'binary'))
# [["bank"], ["Caixa Econ\xC3\xB4mica Federal"]]
So you need to fix the encoding by reading the file in the proper encoding. Hard to say more since we don't know how data_file is being opened.
Have a look at
line[6].encoding
and you should see #<Encoding:UTF-8> in the first case but #<Encoding:ASCII-8BIT> in the second.
There is no “error in codification.”
"Caixa Econ\xC3\xB4mica Federal" == "Caixa Econômica Federal"
#⇒ true
For some reason when printing out a hash, ruby uses this representation (I cannot reproduce it though,) but in a nutshell the string you see is good enough.

Ruby UTF-8 string to UCS-2 conversion

I have a UTF-8 string in my Ruby code. Due to limitations I want to convert the UTF-8 characters in that string to either their escaped equivalents (such as \u23) or simply convert the whole string to UCS-2. I need to explicitly do this to export the data to a file
I tried to do the following in IRB:
my_string = '7.0mΩ'
my_string.encoding
my_string.encode!(Encode::UCS_2BE)
my_string.encoding
The output of that is:
=> "7.0mΩ"
=> #<Encoding::UTF-8>
=> "7.0m\u2126"
=> #<Encoding::UTF-16BE>
This seemed to work fine (I got "ohm" as 2126) until I was reading data out of an array (in Rails):
data.each_with_index do |entry, idx|
puts "#{idx} !! #{entry['title']} !! #{entry['value']} !! #{entry['value'].encode!(Encoding::UCS_2BE)}"
end
That results in the error:
incompatible character encodings: UTF-8 and UTF-16BE
I then tried to write a basic file conversion routine:
File.open(target, 'w', encoding: Encoding::UCS_2BE) do |file|
File.open(source, 'r', encoding: Encoding::UTF_8).each_line do |line|
output.puts(line)
end
end
This resulted in all kinds of weird characters in the file.
Not sure what is going wrong.
Is there a better way to approach this problem of converting UTF-8 data to UCS-2 in Ruby? I really wouldn't mind this actually being changed in the string to \u2126 as a literal part of the string rather than the actual value.
Help!
Temporary Workaround
I monkey-patched this to do what I want. It's not very elegant, but it does the job (and yes, I know it's not pretty... it's just a hack to get what I need):
def hacky_encode
encoded = self
unless encoded.ascii_only?
encoded = scan(/./).map do |char|
char.ascii_only? ? char : char.unpack('U*').map { |i| '\\u' + i.to_s(16).rjust(4, '0') }
end.join
end
encoed
end
Which can be used:
"7.0mΩ".hacky_encode

Percent encoding in Ruby

In Ruby, I get the percent-encoding of 'ä' by
require 'cgi'
CGI.escape('ä')
=> "%C3%A4"
The same with
'ä'.unpack('H2' * 'ä'.bytesize)
=> ["c3", "a4"]
I have two questions:
What is the reverse of the first operation? Shouldn't it be
["c3", "a4"].pack('H2' * 'ä'.bytesize)
=> "\xC3\xA4"
For my application I need 'ä' to be encoded as "%E4" which is the hex-value of 'ä'.ord. Is there any Ruby-method for it?
As I mentioned in my comment, equating the character ä as the codepoint 228 (0xE4) implies that you're dealing with the ISO 8859-1 character encoding.
So, you need to tell Ruby what encoding you want for your string.
str1 = "Hullo ängstrom" # uses whatever encoding is current, generally utf-8
str2 = str1.encode('iso-8859-1')
Then you can encode it as you like:
require 'cgi'
s2c = CGI.escape str2
#=> "Hullo+%E4ngstrom"
require 'uri'
s2u = URI.escape str2
#=> "Hullo%20%E4ngstrom"
Then, to reverse it, you must first (a) unescape the value, and then (b) turn the encoding back into what you're used to (likely UTF-8), telling Ruby what character encoding it should interpret the codepoints as:
s3a = CGI.unescape(s2c) #=> "Hullo \xE4ngstrom"
puts s3a.encode('utf-8','iso-8859-1')
#=> "Hullo ängstrom"
s3b = URI.unescape(s2u) #=> "Hullo \xE4ngstrom"
puts s3b.encode('utf-8','iso-8859-1')
#=> "Hullo ängstrom"

Convert matched string of UTF-8 values to UTF-8 characters in Ruby

Trying to convert output from a rest_client GET to the characters that are represented with escape sequences.
Input: ..."sub_id":"\u0d9c\u8138\u8134\u3f30\u8139\u2b71"...
(which I put in 'all_subs')
Match: m = /sub_id\"\:\"([^\"]+)\"/.match(all_subs.to_str) [1]
Print: puts m.force_encoding("UTF-8").unpack('U*').pack('U*')
But it just comes out the same way I put it in. ie, "\u0d9c\u8138\u8134\u3f30\u8139\u2b71"
However, if I convert a raw string of it:
puts "\u0d9c\u8138\u8134\u3f30\u8139\u2b71".unpack('U*').pack('U*')
The output is perfect as "ග脸脴㼰脹⭱"
What you're getting when you parse the input string is actually this:
m = "\\u0d9c\\u8138\\u8134\\u3f30\\u8139\\u2b71"
Which is not the same as:
"\u0d9c\u8138\u8134\u3f30\u8139\u2b71"
Therefore one option is to eval the string so that ruby applies the codepoints:
puts eval("\"#{m}\"")
=> ග脸脴㼰脹
However note that there are security implications when running eval.
If the string is always like in your example. You could also do something like this, which is safe:
puts m.split("\\u")[1..-1].map { |c| c.to_i(16) }.pack("U*")
=> ග脸脴㼰脹

How to convert these kind of characters to their corresponding unicode characters in Ruby?

I want to know how to convert these kind of characters to their unicode form, such as the following one:
Delphi_7.0%E6%95%B0%E6%8D%AE%E5%BA%93%E5%BC%80%E5%8F%91%E5%85%A5%E9%97%A8%E4%B8%8E%E8%8C%83%E4%BE%8B%E8%A7%A3%E6%9E%90
The unicode characters of the upper string is:
Delphi_7.0数据库开发入门与范例解析
Anybody knows how to do the conversion using Ruby? Thanks.
This is an URI-encoded string:
require 'uri'
#=> true
s = 'Delphi_7.0%E6%95%B0%E6%8D%AE%E5%BA%93%E5%BC%80%E5%8F%91%E5%85%A5%E9%97%A8%E4%B8%8E%E8%8C%83%E4%BE%8B%E8%A7%A3%E6%9E%90'
#=> "Delphi_7.0%E6%95%B0%E6%8D%AE%E5%BA%93%E5%BC%80%E5%8F%91%E5%85%A5%E9%97%A8%E4%B8%8E%E8%8C%83%E4%BE%8B%E8%A7%A3%E6%9E%90"
URI.decode s
#=> "Delphi_7.0数据库开发入门与范例解析"

Resources