Why does HAML throw encoding ISSUE when ERB and ERUBIS dont - ruby

Please follow the code:
__ENCODING__
# => #<Encoding:UTF-8>
Encoding.default_internal
# => #<Encoding:UTF-8>
Encoding.default_external
# => #<Encoding:UTF-8>
Case 1: HAML throws Encoding::UndefinedConversionError
string = "j\xC3\xBCrgen".force_encoding('ASCII-8BIT')
string.encoding
# => #<Encoding:ASCII-8BIT>
Haml::Engine.new("#{string}").render
## => Encoding::UndefinedConversionError: "\xC3" from ASCII-8BIT to UTF-8
ERB.new("<%= string %>").result(binding)
# => "jürgen"
# => Resulting encoding is #<Encoding:UTF-8>
Erubis::Eruby.new("<%= string %>").result(binding)
# => "j\xC3\xBCrgen"
# => resulting encoding is #<Encoding:ASCII-8BIT>
Case 2: HAML doesn't throw error
string = "Ratatouille".force_encoding('ASCII-8BIT')
string.encoding
# => #<Encoding:ASCII-8BIT>
Haml::Engine.new("#{string}").render
## => "Ratatouille\n"
## => resulting encoding is #<Encoding:UTF-8>
ERB.new("<%= string %>").result(binding)
# => "Ratatouille"
# => resulting encoding is #<Encoding:UTF-8>
Erubis::Eruby.new("<%= string %>").result(binding)
# => "Ratatouille"
# => result encoding is #<Encoding:US-ASCII>
Question :
Why is HAML failing in case 1 and succeeding in case 2
Why I'm asking I'm facing the similar problem when a rendering in HAML which blow up page because of Encoding::CompatibilityError
The only way right now I think I know how to avoid error this is do a force_encoding of my string to UTF8 using .force_encoding('UTF-8') which sort of avoid this issue but I have to do this in every page where I want to use the given string i.e "j\xC3\xBCrgen" (which I found kind of lame to do considering their many pages)
Any clue ??

Haml is trying to encode the result string to your Encoding.default_internal setting. In the first example the string ("j\xC3\xBCrgen") contains non ASCII bytes (i.e. bytes with the high bit set), whilst the string in the second example ("Ratatouille") doesn’t. Ruby can encode the second string (since UTF-8 is a superset of ASCII), but can’t encode the first and raises an error.
One way to work round this is to explicitly pass the string encoding as an option to Haml::Encoding:
Haml::Engine.new("#{string}", :encoding => Encoding::ASCII_8BIT).render
this will give you a result string that is also ASCII-8BIT.
In this case the string in question is UTF-8 though, so a better solution might be to look at where the string is coming from in your app and ensure it has the right encoding.
I don’t know enough about ERB and Erubis to say what’s happening, it looks like ERB is incorrectly assuming it is UTF-8 (it has now way to know those bytes should actually be treated as UTF-8) and Erubis is doing the more sensible thing of leaving the encoding as binary – either because it isn’t doing any encoding at all, or it is treating binary encoded input specially.

From the PickAxe book:
Ruby supports a virtual encoding called ASCII-8BIT . Despite the ASCII in the name, this is
really intended to be used on data streams that contain binary data (which is why it has an
alias of BINARY }). However, you can also use this as an encoding for source files. If you do,
Ruby interprets all characters with codes below 128 as regular ASCII and all other characters
as valid constituents of variable names. This is basically a neat hack, because it allows you
to compile a file written in an encoding you don’t know—the characters with the high-order
bit set will be assumed to be printable.
String#force_encoding tells Ruby which encoding to use in order to interpret some binary data. It does not change/convert the actual bytes (that would be String#encode), just changes the encoding associated with these bytes.
Why would you try to associate a BINARY encoding to a string containing UTF-8 characters anyway?
Regarding your question about why the second case succeeds the answer is simply that your second string ("Ratatouille") contains only 7-bit ASCII characters.

Related

Ruby URI.extract returns empty array or ArgumentError: invalid byte sequence in UTF-8

I'm trying to get a list of files from url like this:
require 'uri'
require 'open-uri'
url = 'http://www.wmprof.com/media/niti/download'
html = open(url).read
puts URI.extract(html).select{ |link| link[/(PL)/]}
This code returns ArgumentError: invalid byte sequence in UTF-8 in line with URI.extract (even though html.encoding returns utf-8)
I've found some solutions to encoding problems, but when I'm changing the code to
html.encode('UTF-8', invalid: :replace, undef: :replace, replace: '?')
URI.extract returns empty string, even when I'm not calling the select method on it. Any suggestions?
The character encoding of the website might be ISO-8859-1 or a related one. We can't tell for sure since there are only two occurrences of the same non-US-ASCII-character and it doesn't really matter anyway.
html.each_char.reject(&:ascii_only?) # => ["\xDC", "\xDC"]
Finding the actual encoding is done by guessing. The age of HTML 3.2 or the used language/s might be a clue. And in this case especially the content of the PDF file is helpful (it contains SPRÜH-EX and the file has the name TI_DE_SPR%dcH_EX.pdf). Then we only need to find the encoding for which "\xDC" and "Ü" are equal. Either by knowing it or writing some Ruby:
Encoding.list.select { |e| "Ü" == "\xDC".encode!(Encoding::UTF_8, e) rescue next }.map(&:name)
Of course, letting a program do the guessing is an option too. There is the libguess library. The web browser can do it too. However you you need to download the file though unless the server might tell the browser it's UTF-8 even if it isn't (like in this case). Any decent text editor will also try to detect the file encoding: e.g. ST3 thinks it's Windows 1252 which is a superset of ISO-8859-1 (like UTF-8 is of US-ASCII).
Possible solutions are manually setting the string encoding to ISO-8859-1:
html.force_encoding(Encoding::ISO_8859_1)
Or (preferably) transcoding the string from ISO-8859-1 to UTF-8:
html.encode!(Encoding::UTF_8, Encoding::ISO_8859_1)
To answer the other question: URI.extract isn't the method you're looking for. Apparently it's obsolete and more importantly, it doesn't extract relative URI.
A simple alternative is using a regular expression with String#scan. It works with this site but it might not with other ones. You have to use a HTML parser for the best reliability (there might be also a gem). Here's an example that should do what you want:
html.scan(/href="(.*?PL.*?)"/).flatten # => ["SI_PL_ACTIV_bicompact.pdf", ...]

Convert a unicode string to characters in Ruby?

I have the following string:
l\u0092issue
My question is how to convert it to utf8 characters ?
I have tried that
1.9.3p484 :024 > "l\u0092issue".encode('utf-8')
=> "l\u0092issue"
You seem to have got your encodings into a bit of a mix up. If you haven’t already, you should first read Joel Spolsky’s article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) which provides a good introduction into this type of thing. There is a good set of articles on how Ruby handles character encodings at http://graysoftinc.com/character-encodings/understanding-m17n-multilingualization. You could also have a look at the Ruby docs for String and Encoding.
In this specific case, the string l\u0092issue means that the second character is the character with the unicode codepoint 0x92. This codepoint is PRIVATE USE TWO (see the chart), which basically means this position isn’t used.
However, looking at the Windows CP-1252 encoding, position 0x92 is occupied by the character ’, so if this is the missing character the the string would be l’issue, whick looks a lot more likely even though I don’t speak French.
What I suspect has happened is your program has received the string l’issue encoded in CP-1252, but has assumed it was encoded in ISO-8859-1 (ISO-8859-1 and CP-1252 are quite closely related) and re-encoded it to UTF-8 leaving you with the string you now have.
The real fix for you is to be careful about the encodings of any strings that enter (and leave) your program, and how you manage them.
To transform your string to l’issue, you can encode it back to ISO-8859-1, then use force_encoding to tell Ruby the real encoding of CP-1252, and then you can re-encode to UTF-8:
2.1.0 :001 > s = "l\u0092issue"
=> "l\u0092issue"
2.1.0 :002 > s = s.encode('iso-8859-1')
=> "l\x92issue"
2.1.0 :003 > s.force_encoding('cp1252')
=> "l\x92issue"
2.1.0 :004 > s.encode('utf-8')
=> "l’issue"
This is only really a demonstration of what is going on though. The real solution is to make sure you’re handling encodings correctly.
That is encoded as UTF-8 (unless you changed the original string encoding). Ruby is just showing you the escape sequences when you inspect the string (which is why IRB does there). \u0092 is the escape sequence for this character.
Try puts "l\u0092issue" to see the rendered character, if your terminal font supports it.

Ruby, pack encoding (ASCII-8BIT that cannot be converted to UTF-8)

puts "C3A9".lines.to_a.pack('H*').encoding
results in
ASCII-8BIT
but I prefer this text in UTF-8. But
"C3A9".lines.to_a.pack('H*').encode("UTF-8")
results in
`encode': "\xC3" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
why? How can I convert it to UTF-8?
You're going about this the wrong way. If you have URI encoded data like this:
%C5%BBaba
Then you should use URI.unescape to decode it:
1.9.2-head :004 > URI.unescape('%C5%BBaba')
=> "Żaba"
If that doesn't work then force the encoding to UTF-8:
1.9.2-head :004 > URI.unescape('%C5%BBaba').force_encoding('utf-8')
=> "Żaba"
ASCII-8bit is a pretend encoding native to Ruby. It has an alias to BINARY, and it is just that. ASCII-8bit is not a character encoding, but rather a way of saying that a string is binary data and not to be processed like text. Because pack/unpack functions are designed to operate on binary data, you should never assume that is returned is printable under any encoding unless the ENTIRE pack string is made up of character derivatives. If you clarify what the overall goal is, maybe we could give you a better solution.
If you isolate a hex UTF-8 code into a variable, say code which is a string of the hexadecimal format minus percent sign:
utf_char=[code.to_i(16)].pack("U")
Combine these with the rest of the string, you can make your string.

Encoding Unicode code points with Ruby

I'm retrieving an HTML document that is parsed with Nokogiri. The HTML is using charset ISO-8859-1. The problem is there are some Unicode chars in the document which are converted to Unicode code points instead of their respective character.
For example, this is some text in the HTML as received (in ISO-8859-1):
\x95\x95 JOHNNY VENETTI \x95\x95
And when attempting to work with this text, it gets converted to this:
\u0095\u0095 JOHNNY VENETTI \u0095\u0095
So my question is, how can I ensure those characters are represented as their appropriate character instead of the code point? I've tried doing a gsub on the text, but that seems wrong for this. Also, I do not have control over the encoding of the HTML document.
First you should realize that this string is NOT ISO-8859-1 encoded (file says "Non-ISO extended-ASCII text" and the codepage verifies this). May well be this is your problem, in that case you should specify the right encoding (probably something like Windows-1252, in this case) in your HTML document.
In Nokogiri, you can also set the encoding explicitly in cases where the document specifies the wrong encoding:
Nokogiri.HTML("<p>\x95\x95 JOHNNY VENETTI \x95\x95</p>", nil, "Windows-1252")
# => #<Nokogiri::HTML::Document: ...
# children=[#<Nokogiri::XML::Text:0x15744cc "•• JOHNNY VENETTI ••">]>]>]>]>
If you don't have the option to solve this cleanly like above, you can also do it the hard way and associated the string with its correct encoding:
s = "\x95\x95 JOHNNY VENETTI \x95\x95"
s.encoding # => #<Encoding:ASCII-8BIT>
s.force_encoding 'Windows-1252'
s.encode! 'utf-8'
s # => "•• JOHNNY VENETTI ••"
Note that this last piece of code is Ruby 1.9 only. If you want, you can read more about the new encoding system in Ruby 1.9.

Convert non-ASCII chars from ASCII-8BIT to UTF-8

I'm pulling text from remote sites and trying to load it into a Ruby 1.9/Rails 3 app that uses utf-8 by default.
Here is an example of some offending text:
Cancer Res; 71(3); 1-11. ©2011 AACR.\n
That Copyright code expanded looks like this:
Cancer Res; 71(3); 1-11. \xC2\xA92011 AACR.\n
Ruby tells me that string is encoded as ASCII-8BIT and feeding into my Rails app gets me this:
incompatible character encodings: ASCII-8BIT and UTF-8
I can strip the copyright code out using this regex
str.gsub(/[\x00-\x7F]/n,'?')
to produce this
Cancer Res; 71(3); 1-11. ??2011 AACR.\n
But how can I get a copyright symbol (and various other symbols such as greek letters) converted into the same symbols in UTF-8? Surely it is possible...
I see references to using force_encoding but this does not work:
str.force_encoding('utf-8').encode
I realize there are many other people with similar issues but I've yet to see a solution that works.
This works for me:
#encoding: ASCII-8BIT
str = "\xC2\xA92011 AACR"
p str, str.encoding
#=> "\xC2\xA92011 AACR"
#=> #<Encoding:ASCII-8BIT>
str.force_encoding('UTF-8')
p str, str.encoding
#=> "©2011 AACR"
#=> #<Encoding:UTF-8>
There are two possibilities:
The input data is already UTF-8, but Ruby just doesn't know it. That seems to be your case, as "\xC2\xA9" is valid UTF-8 for the copyright symbol. In which case you just need to tell Ruby that the data is already UTF-8 using force_encoding.
For example "\xC2\xA9".force_encoding('ASCII-8BIT') would recreate the relevant bit of your input data. And "\xC2\xA9".force_encoding('ASCII-8BIT').force_encoding('UTF-8') would demonstrate that you can tell Ruby that it is really UTF-8 and get the desired result.
The input data is in some other encoding and you need Ruby to transcode it to UTF-8. In that case you'd have to tell Ruby what the current encoding is (ASCII-8BIT is ruby-speak for binary, it isn't a real encoding), then tell Ruby to transcode it.
For example, say your input data was ISO-8859-1. In that encoding the copyright symbol is just "\xA9". This would generate such a bit of data: "\xA9".force_encoding('ISO-8859-1') And this would demonstrate that you can get Ruby to transcode that to UTF-8: "\xA9".force_encoding('ISO-8859-1').encode('UTF-8')
I used to do this for a script that scraped Greek Windows-encoded pages, using open-uri, iconv and Hpricot:
doc = open(DATA_URL)
doc.rewind
data = Hpricot(Iconv.conv('utf-8', "WINDOWS-1253", doc.readlines.join("\n")))
I believe that was Ruby 1.8.7, not sure how things are with ruby 1.9
I've been having issues with character encoding, and the other answers have been helpful, but didn't work for every case. Here's the solution I came up with that forces encoding when possible and transcodes using '?'s when not possible. Here's the solution:
def encode str
encoded = str.force_encoding('UTF-8')
unless encoded.valid_encoding?
encoded = str.encode("utf-8", invalid: :replace, undef: :replace, replace: '?')
end
encoded
end
force_encoding works most of the time, but I've encountered some strings where that fails. Strings like this will have invalid characters replaced:
str = "don't panic: \xD3"
str.valid_encoding?
false
str = str.encode("utf-8", invalid: :replace, undef: :replace, replace: '?')
"don't panic: ?"
str.valid_encoding?
true
Update: I have had some issues in production with the above code. I recommend that you set up unit tests with known problem text to make sure that this code works for you like you need it to. Once I come up with version 2 I'll update this answer.

Resources