Parsing UTF-8 with Faraday - ruby

I'm making an API request with Faraday in Ruby and I'm parsing it with JSON.parse. The problem is that, the JSON response has sentences such as Longitud de la estaci\u00F3n meteorol\u00F3gica (grados) but it should be Longitud de la estación meteorológica (grados).
Is there a way to properly parse this?
I have connection = Faraday.new(my_site) and if I do connection.get.body.encoding, then I get #<Encoding:ASCII-8BIT>, but when I try connection.get.body.force_encoding('ASCII-8BIT).force_encoding('UTF-8) or connection.get.body.force_encoding('ASCII-8BIT).encode('UTF-8) I get 'encode': "\xF3" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError).
Thanks a lot in advance!

Try this:
connection.get.body.force_encoding('ISO-8859-1').encode('UTF-8')
I don't know about Faraday, but judging from Encoding::UndefinedConversionError error, that is perhaps the case. I am assuming connection.get.body returns a normal String instance or its equivalent.
Background
As the official document (Ver.2.5.1) states, you should not try to convert ASCII-8BIT to any other encodings:
Encoding::ASCII_8BIT is a special encoding that is usually used for a byte string, not a character string.
The so-called extended ASCII, which contains some punctuations for alphabet, is usually ISO-8859-1, though other encoding methods exist. Certainly the codepoint of o with ' is \xF3 in ISO-8859-1. Here is a code snippet to demonstrate it:
"\xf3".force_encoding('ISO-8859-1').encode('UTF-8')
# => "ó"
"\xf3".force_encoding('ASCII-8BIT').encode('UTF-8')
# => Encoding::UndefinedConversionError
This past answer explains it in a bit more detail.

Related

Convert UTF-8 to CP1252 ruby 2.2

How to keep all characters converting from UTF-8 to CP1252 on ruby 2.2
this code:
file = 'd:/1 descrição.txt'
puts file.encode('cp1252')
Give this error:
`encode': U+0327 to WINDOWS-1252 in conversion from UTF-8 to WINDOWS-1252 (Encoding::UndefinedConversionError)
My application need to be cp1252, but I can't find any way to keep all the characters.
I can't replace this characters, because later I will use this info to read the file from file system.
puts file.encode('cp1252', undef: :replace, replace: '')
> d:/1 descricao.txt
ps: It is a ruby script not a ruby on rails application
UTF-8 covers the entire range of unicode, but CP1252 only includes a subset of them. Obviously this means that there are characters that can be encoded in UTF-8 but not in CP1252. This is the problem you are facing.
In your example it looks like the string only contains characters that should work in CP1252, but clearly it doesn’t.
The character in the error message, U+0327 is a combining character, and is not representable in CP1252. It combines with the preceding c to produce ç. ç can also be represented as a single character (U+00E7), which is representable in CP1252.
One option might be normalisation, which will convert the string into a form that is representable in CP1252.
file = 'd:/1 descrição.txt'.unicode_normalize(:nfc)
puts file.encode('cp1252')
(It appears that Stack Overflow is normalizing the string when displaying your question, which is probably why copying the code from the question and running it doesn’t produce any errors.)
This will avoid the error, but note that it is not necessarily possible to reverse the process to get the original string unless the original is in a known normalized form already.

Ruby character encoding issue with scraped HTML

I'm having a character encoding issue with a Ruby script that does some HTML scraping and parsing with the Nokogiri gem. At one point in the script, I call join("\n") on an array of strings that have been pulled from some HTML, which causes this error:
./script.rb:333:in `join': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
In my logs, I can see Café showing up for some of the strings that would be included in the join operation.
Is it that some of the strings in my array to be joined are ASCII-8BIT and some are UTF-8 and ruby can't combine them? Do I need to convert or sanitize my strings after parsing them with Nokogiri (into UTF-8)?.
I tried force_encoding('UTF-8') and encode('UTF-8') on the scraped HTML content before I do anything else with it, but it didn't help. In fact, after I tried encode('UTF-8'), my script crashed even earlier when it called to_s on a string containing Café.
Character encoding always really confuses me. Is there something else I can do to sanitize the strings to avoid this error?
Edit:
I was doing something similar in Perl recently and used a module called Text::Unidecode and was able to pass my strings to a function that translates any problematic characters e.g. the letter a with an acute to the plain letter a. Is there anything similar for ruby? (This isn't necessarily what I'm aiming for though, if I can keep the a with acute then that's preferable I think.
Edit2:
I'm really confused by this and it's proving difficult to reproduce reliably. Here's some code:
[CODE REMOVED]
Edit3:
I removed the previously posted code example because it wasn't correct. But the bottom line is, whenever I try to print or call to_s on the string that was scraped, I get the encoding error.
Edit4:
It turned out in the end that the scraped html input was not what was causing the problem. I got the encoding error whenever I tried to print or call to_s on a hash containing, among other things, the scraped html text. The 'other things' were values from database queries, and they were being returned in ASCII-8BIT. To fix the issue, I explicitly had to call force_encoding('UTF-8') on each database value that I use (although I hear that the mysql2 gem does this automatically so I should switch to that).
I hate character encoding.
Presumably, Café is supposed to be Café. If we start out with Café in UTF-8 but treat the bytes as though they were encoded in ISO-8859-1 (AKA Latin-1) and then re-encode them as UTF-8, we get the Café that you're seeing; for example:
> s = 'Café'
=> "Café"
> s.encoding
=> #<Encoding:UTF-8>
> s.force_encoding('iso-8859-1').encode('utf-8')
=> "Café"
So somewhere you're reading a UTF-8 string but treating it as Latin-1 and re-encoding it as UTF-8. I'd guess that Nokogiri is reading the page and thinking that it is Latin-1 or being told by your user agent that it is getting Latin-1 text. Perhaps you have a bad default encoding somewhere, or the HTTP headers are lying about the encoding, or the page itself is lying about its encoding.
You need to get everything into UTF-8 at the edges of your scraper. Figure out who is lying about the encoding and sort it out right there.
Don't feel bad, scraping and encoding is a nightmare of confusion, stupidity, guesswork, and hard liquor. Servers lie, pages lie, browsers lie, no one is happy.

How to convert a string to UTF8 in Ruby

I'm writing a crawler which uses Hpricot. It downloads a list of strings from some webpage, then I try to write it to the file. Something is wrong with the encoding:
"\xC3" from ASCII-8BIT to UTF-8
I have items which are rendered on a webpage and printed this way:
Développement
the str.encoding returns UTF-8, so force_encoding('UTF-8') doesn't help. How may I convert this to readable UTF-8?
Your string seems to have been encoded the wrong way round:
"Développement".encode("iso-8859-1").force_encoding("utf-8")
#=> "Développement"
Seems your string thinks it is UTF-8, but in reality, it is something else, probably ISO-8859-1.
Define (force) the correct encoding first, then convert it to UTF-8.
In your example:
puts "Développement".encode('iso-8859-1').encode('utf-8')
An alternative is:
puts "\xC3".force_encoding('iso-8859-1').encode('utf-8') #-> Ã
If the à makes no sense, then try another encoding.
"ruby 1.9: invalid byte sequence in UTF-8" described another good approach with less code:
file_contents.encode!('UTF-16', 'UTF-8')

Ruby, pack encoding (ASCII-8BIT that cannot be converted to UTF-8)

puts "C3A9".lines.to_a.pack('H*').encoding
results in
ASCII-8BIT
but I prefer this text in UTF-8. But
"C3A9".lines.to_a.pack('H*').encode("UTF-8")
results in
`encode': "\xC3" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
why? How can I convert it to UTF-8?
You're going about this the wrong way. If you have URI encoded data like this:
%C5%BBaba
Then you should use URI.unescape to decode it:
1.9.2-head :004 > URI.unescape('%C5%BBaba')
=> "Żaba"
If that doesn't work then force the encoding to UTF-8:
1.9.2-head :004 > URI.unescape('%C5%BBaba').force_encoding('utf-8')
=> "Żaba"
ASCII-8bit is a pretend encoding native to Ruby. It has an alias to BINARY, and it is just that. ASCII-8bit is not a character encoding, but rather a way of saying that a string is binary data and not to be processed like text. Because pack/unpack functions are designed to operate on binary data, you should never assume that is returned is printable under any encoding unless the ENTIRE pack string is made up of character derivatives. If you clarify what the overall goal is, maybe we could give you a better solution.
If you isolate a hex UTF-8 code into a variable, say code which is a string of the hexadecimal format minus percent sign:
utf_char=[code.to_i(16)].pack("U")
Combine these with the rest of the string, you can make your string.

gsub :: ArgumentError (invalid byte sequence in UTF-8)

This code uses the Hpricot gem to get HTML that contains UTF-8 characters.
# <div>This is a test测试</div>
div[0].to_html.gsub(/test/, "")
When that is run, it spits out this error (pointing at gsub):
ArgumentError (invalid byte sequence in UTF-8)
How can we fix this issue?
Figured out the issue. Hpricot's to_html calls methods that trigger the error so to get rid of that we need to make the Hpricot document encoding UTF-8, not just that one string. We do that like this:
ic = Iconv.new("UTF-8//IGNORE", "UTF-8")
doc = open("http://example.com") {|f| Hpricot(ic.iconv(f.read)) }
And then we can call other Hpricot methods but now the whole document has UTF-8 encoding and it won't give us any errors.
The to_html looks to return a non-utf8 string in this case.
I had same problem with file containing some non-utf8 characters. The fix I found is not really beautiful, but it could also works for your case :
the_utf8_string = the_non_utf8_string.unpack('C*').pack('U*')
Be careful, I'm not sure there is no one data lost.

Resources