How to convert a string to UTF8 in Ruby - ruby

I'm writing a crawler which uses Hpricot. It downloads a list of strings from some webpage, then I try to write it to the file. Something is wrong with the encoding:
"\xC3" from ASCII-8BIT to UTF-8
I have items which are rendered on a webpage and printed this way:
Développement
the str.encoding returns UTF-8, so force_encoding('UTF-8') doesn't help. How may I convert this to readable UTF-8?

Your string seems to have been encoded the wrong way round:
"Développement".encode("iso-8859-1").force_encoding("utf-8")
#=> "Développement"

Seems your string thinks it is UTF-8, but in reality, it is something else, probably ISO-8859-1.
Define (force) the correct encoding first, then convert it to UTF-8.
In your example:
puts "Développement".encode('iso-8859-1').encode('utf-8')
An alternative is:
puts "\xC3".force_encoding('iso-8859-1').encode('utf-8') #-> Ã
If the à makes no sense, then try another encoding.

"ruby 1.9: invalid byte sequence in UTF-8" described another good approach with less code:
file_contents.encode!('UTF-16', 'UTF-8')

Related

Parsing UTF-8 with Faraday

I'm making an API request with Faraday in Ruby and I'm parsing it with JSON.parse. The problem is that, the JSON response has sentences such as Longitud de la estaci\u00F3n meteorol\u00F3gica (grados) but it should be Longitud de la estación meteorológica (grados).
Is there a way to properly parse this?
I have connection = Faraday.new(my_site) and if I do connection.get.body.encoding, then I get #<Encoding:ASCII-8BIT>, but when I try connection.get.body.force_encoding('ASCII-8BIT).force_encoding('UTF-8) or connection.get.body.force_encoding('ASCII-8BIT).encode('UTF-8) I get 'encode': "\xF3" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError).
Thanks a lot in advance!
Try this:
connection.get.body.force_encoding('ISO-8859-1').encode('UTF-8')
I don't know about Faraday, but judging from Encoding::UndefinedConversionError error, that is perhaps the case. I am assuming connection.get.body returns a normal String instance or its equivalent.
Background
As the official document (Ver.2.5.1) states, you should not try to convert ASCII-8BIT to any other encodings:
Encoding::ASCII_8BIT is a special encoding that is usually used for a byte string, not a character string.
The so-called extended ASCII, which contains some punctuations for alphabet, is usually ISO-8859-1, though other encoding methods exist. Certainly the codepoint of o with ' is \xF3 in ISO-8859-1. Here is a code snippet to demonstrate it:
"\xf3".force_encoding('ISO-8859-1').encode('UTF-8')
# => "ó"
"\xf3".force_encoding('ASCII-8BIT').encode('UTF-8')
# => Encoding::UndefinedConversionError
This past answer explains it in a bit more detail.

I just want to write a simple µ in Ruby and Prawn

my head is on the edge of exploding...
How can I encode my string to UTF-8?
I always get this error:
Arguments to text methods must be UTF-8 encoded
I use Prawn as PDF Writer.
put
# encoding: utf-8
at first line of your rb file.
You could use force_encoding:
"some string".force_encoding("UTF-8")

Ruby, pack encoding (ASCII-8BIT that cannot be converted to UTF-8)

puts "C3A9".lines.to_a.pack('H*').encoding
results in
ASCII-8BIT
but I prefer this text in UTF-8. But
"C3A9".lines.to_a.pack('H*').encode("UTF-8")
results in
`encode': "\xC3" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
why? How can I convert it to UTF-8?
You're going about this the wrong way. If you have URI encoded data like this:
%C5%BBaba
Then you should use URI.unescape to decode it:
1.9.2-head :004 > URI.unescape('%C5%BBaba')
=> "Żaba"
If that doesn't work then force the encoding to UTF-8:
1.9.2-head :004 > URI.unescape('%C5%BBaba').force_encoding('utf-8')
=> "Żaba"
ASCII-8bit is a pretend encoding native to Ruby. It has an alias to BINARY, and it is just that. ASCII-8bit is not a character encoding, but rather a way of saying that a string is binary data and not to be processed like text. Because pack/unpack functions are designed to operate on binary data, you should never assume that is returned is printable under any encoding unless the ENTIRE pack string is made up of character derivatives. If you clarify what the overall goal is, maybe we could give you a better solution.
If you isolate a hex UTF-8 code into a variable, say code which is a string of the hexadecimal format minus percent sign:
utf_char=[code.to_i(16)].pack("U")
Combine these with the rest of the string, you can make your string.

Convert non-ASCII chars from ASCII-8BIT to UTF-8

I'm pulling text from remote sites and trying to load it into a Ruby 1.9/Rails 3 app that uses utf-8 by default.
Here is an example of some offending text:
Cancer Res; 71(3); 1-11. ©2011 AACR.\n
That Copyright code expanded looks like this:
Cancer Res; 71(3); 1-11. \xC2\xA92011 AACR.\n
Ruby tells me that string is encoded as ASCII-8BIT and feeding into my Rails app gets me this:
incompatible character encodings: ASCII-8BIT and UTF-8
I can strip the copyright code out using this regex
str.gsub(/[\x00-\x7F]/n,'?')
to produce this
Cancer Res; 71(3); 1-11. ??2011 AACR.\n
But how can I get a copyright symbol (and various other symbols such as greek letters) converted into the same symbols in UTF-8? Surely it is possible...
I see references to using force_encoding but this does not work:
str.force_encoding('utf-8').encode
I realize there are many other people with similar issues but I've yet to see a solution that works.
This works for me:
#encoding: ASCII-8BIT
str = "\xC2\xA92011 AACR"
p str, str.encoding
#=> "\xC2\xA92011 AACR"
#=> #<Encoding:ASCII-8BIT>
str.force_encoding('UTF-8')
p str, str.encoding
#=> "©2011 AACR"
#=> #<Encoding:UTF-8>
There are two possibilities:
The input data is already UTF-8, but Ruby just doesn't know it. That seems to be your case, as "\xC2\xA9" is valid UTF-8 for the copyright symbol. In which case you just need to tell Ruby that the data is already UTF-8 using force_encoding.
For example "\xC2\xA9".force_encoding('ASCII-8BIT') would recreate the relevant bit of your input data. And "\xC2\xA9".force_encoding('ASCII-8BIT').force_encoding('UTF-8') would demonstrate that you can tell Ruby that it is really UTF-8 and get the desired result.
The input data is in some other encoding and you need Ruby to transcode it to UTF-8. In that case you'd have to tell Ruby what the current encoding is (ASCII-8BIT is ruby-speak for binary, it isn't a real encoding), then tell Ruby to transcode it.
For example, say your input data was ISO-8859-1. In that encoding the copyright symbol is just "\xA9". This would generate such a bit of data: "\xA9".force_encoding('ISO-8859-1') And this would demonstrate that you can get Ruby to transcode that to UTF-8: "\xA9".force_encoding('ISO-8859-1').encode('UTF-8')
I used to do this for a script that scraped Greek Windows-encoded pages, using open-uri, iconv and Hpricot:
doc = open(DATA_URL)
doc.rewind
data = Hpricot(Iconv.conv('utf-8', "WINDOWS-1253", doc.readlines.join("\n")))
I believe that was Ruby 1.8.7, not sure how things are with ruby 1.9
I've been having issues with character encoding, and the other answers have been helpful, but didn't work for every case. Here's the solution I came up with that forces encoding when possible and transcodes using '?'s when not possible. Here's the solution:
def encode str
encoded = str.force_encoding('UTF-8')
unless encoded.valid_encoding?
encoded = str.encode("utf-8", invalid: :replace, undef: :replace, replace: '?')
end
encoded
end
force_encoding works most of the time, but I've encountered some strings where that fails. Strings like this will have invalid characters replaced:
str = "don't panic: \xD3"
str.valid_encoding?
false
str = str.encode("utf-8", invalid: :replace, undef: :replace, replace: '?')
"don't panic: ?"
str.valid_encoding?
true
Update: I have had some issues in production with the above code. I recommend that you set up unit tests with known problem text to make sure that this code works for you like you need it to. Once I come up with version 2 I'll update this answer.

display iso-8859-1 encoded data gives strange characters

I have a ISO-8859-1 encoded csv-file that I try to open and parse with ruby:
require 'csv'
filename = File.expand_path('~/myfile.csv')
file = File.open(filename, "r:ISO-8859-1")
CSV.parse(file.read, col_sep: "\t") do |row|
puts row
end
If I leave out the encoding from the call to File.open, I get an error
ArgumentError: invalid byte sequence in UTF-8
My problem is that the call to puts row displays strange characters instead of the norwegian characters æ,ø,å:
BOKF�RINGSDATO
I get the same if I open the file in textmate, forcing it to use UTF-8 encoding.
By assigning the file content to a string, I can check the encoding used for the string. As expected, it shows ISO-8859-1.
So when I puts each row, why does it output the string as UTF-8?
Is it something to do with the csv-library?
I use ruby 1.9.2.
Found myself an answer by trying different things from the documentation:
require 'csv'
filename = File.expand_path('~/myfile.csv')
File.open(filename, "r:ISO-8859-1") do |file|
CSV.parse(file.read.encode("UTF-8"), col_sep: "\t") do |row|
# ↳ returns a copy transcoded to UTF-8.
puts row
end
end
As you can see, all I have done, is to encode the string to an UTF-8 string before the CSV-parser gets it.
Edit:
Trying this solution on macruby-head, I get the following error message from encode( ):
Encoding::InvalidByteSequenceError: "\xD8" on UTF-8
Even though I specify encoding when opening the file, macruby use UTF-8.
This seems to be an known macruby limitation: Encoding is always UTF-8
Maybe you could use Iconv to convert the file contents to UTF-8 before parsing?
ISO-8859-1 and Win-1252 are reaallly close in their character sets. Could some app have processed the file and converted it? Or could it have been received from a machine that was defaulting to Win-1252, which is Window's standard setting?
Software that senses the code-set can get the encoding wrong if there are no characters in the 0x80 to 0x9F byte-range so you might try setting file = File.open(filename, "r:ISO-8859-1") to file = File.open(filename, "r:Windows-1252"). (I think "Windows-1252" is the right encoding name.)
I used to write spiders, and HTML is notorious for being mis-labeled or for having encoded binary characters from one character set embedded in another. I used some bad language many times over these problems several years ago, before most languages had implemented UTF-8 and Unicode so I understand the frustration.
ISO/IEC_8859-1,
Windows-1252

Resources