Ruby character encoding issue with scraped HTML - ruby

I'm having a character encoding issue with a Ruby script that does some HTML scraping and parsing with the Nokogiri gem. At one point in the script, I call join("\n") on an array of strings that have been pulled from some HTML, which causes this error:
./script.rb:333:in `join': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
In my logs, I can see Café showing up for some of the strings that would be included in the join operation.
Is it that some of the strings in my array to be joined are ASCII-8BIT and some are UTF-8 and ruby can't combine them? Do I need to convert or sanitize my strings after parsing them with Nokogiri (into UTF-8)?.
I tried force_encoding('UTF-8') and encode('UTF-8') on the scraped HTML content before I do anything else with it, but it didn't help. In fact, after I tried encode('UTF-8'), my script crashed even earlier when it called to_s on a string containing Café.
Character encoding always really confuses me. Is there something else I can do to sanitize the strings to avoid this error?
Edit:
I was doing something similar in Perl recently and used a module called Text::Unidecode and was able to pass my strings to a function that translates any problematic characters e.g. the letter a with an acute to the plain letter a. Is there anything similar for ruby? (This isn't necessarily what I'm aiming for though, if I can keep the a with acute then that's preferable I think.
Edit2:
I'm really confused by this and it's proving difficult to reproduce reliably. Here's some code:
[CODE REMOVED]
Edit3:
I removed the previously posted code example because it wasn't correct. But the bottom line is, whenever I try to print or call to_s on the string that was scraped, I get the encoding error.
Edit4:
It turned out in the end that the scraped html input was not what was causing the problem. I got the encoding error whenever I tried to print or call to_s on a hash containing, among other things, the scraped html text. The 'other things' were values from database queries, and they were being returned in ASCII-8BIT. To fix the issue, I explicitly had to call force_encoding('UTF-8') on each database value that I use (although I hear that the mysql2 gem does this automatically so I should switch to that).
I hate character encoding.

Presumably, Café is supposed to be Café. If we start out with Café in UTF-8 but treat the bytes as though they were encoded in ISO-8859-1 (AKA Latin-1) and then re-encode them as UTF-8, we get the Café that you're seeing; for example:
> s = 'Café'
=> "Café"
> s.encoding
=> #<Encoding:UTF-8>
> s.force_encoding('iso-8859-1').encode('utf-8')
=> "Café"
So somewhere you're reading a UTF-8 string but treating it as Latin-1 and re-encoding it as UTF-8. I'd guess that Nokogiri is reading the page and thinking that it is Latin-1 or being told by your user agent that it is getting Latin-1 text. Perhaps you have a bad default encoding somewhere, or the HTTP headers are lying about the encoding, or the page itself is lying about its encoding.
You need to get everything into UTF-8 at the edges of your scraper. Figure out who is lying about the encoding and sort it out right there.
Don't feel bad, scraping and encoding is a nightmare of confusion, stupidity, guesswork, and hard liquor. Servers lie, pages lie, browsers lie, no one is happy.

Related

How to use internal/external encoding when importing a YAML file?

How can I load a YAML file regardlessly of its encoding?
My YAML file can be encoded in UTF-8 or ANSI (that's what Notepad++ calls it - I guess it's Windows-1252):
:key1:
:key2: "ä"
utf8.yml is encoded in UTF-8, ansi.yml is encoded in ANSI. I load the files as follows:
# encoding: utf-8
Encoding.default_internal = "utf-8"
utf8_load = YAML::load(File.open('utf8.yml'))
utf8_load_file = YAML::load_file('utf8.yml')
ansi_load = YAML::load(File.open('ansi.yml'))
ansi_load_file = YAML::load_file('ansi.yml')
It seems like Ruby doesn't recognize the encoding correctly:
utf8_load [:key1][:key2].encoding #=> "UTF-8"
utf8_load_file [:key1][:key2].encoding #=> "UTF-8"
ansi_load [:key1][:key2].encoding #=> "UTF-8"
ansi_load_file [:key1][:key2].encoding #=> "UTF-8"
because the bytes aren't the same:
utf8_load [:key1][:key2].bytes #=> [195, 164]
utf8_load_file [:key1][:key2].bytes #=> [195, 164]
ansi_load [:key1][:key2].bytes #=> [239, 191, 189]
ansi_load_file [:key1][:key2].bytes #=> [239, 191, 189]
If I miss Encoding.default_internal = "utf-8", the bytes are also different:
utf8_load [:key1][:key2].bytes #=> [195, 131, 194, 164]
utf8_load_file [:key1][:key2].bytes #=> [195, 164]
ansi_load [:key1][:key2].bytes #=> [195, 164]
ansi_load_file [:key1][:key2].bytes #=> [239, 191, 189]
What happens actually when I don't set the default_internal to utf-8?
Which encodings do the strings in both examples have?
How can I load a file even if I don't know its encoding?
I believe officially YAML only supports UTF-8 (and maybe UTF-16). There have historically been all sorts of encoding confusions in YAML libraries. I think you are going to run into trouble trying to have YAML in something other than a Unicode encoding.
What happens actually when I don't set the default_internal to utf-8?
Encoding.default_internal controls the encoding your input will be converted to when it is read in, at least by some operations that respect Encoding.default_internal, not everything does. Rails seems to set it to UTF-8. So if you don't set the Encoding.default_internal to UTF-8, it might be UTF-8 already anyway.
If Encoding.default_internal is nil, then those operations that respect it, and try to convert any input to Encoding.default_internal upon reading it in won't do that, they'll leave any input in the encoding it was believed to originate in, not try to convert it.
If you set it to something else, like say "WINDOWS-1252" Ruby would automatically convert your stuff to WINDOWS-1252 when it read it in with File.open, which would possibly confuse YAML::load when you pass the string that's now encoded and tagged as WINDOWS-1252 to it. Generally there's no good reason to do this, so leave Encoding.default_internal alone.
Note: The Ruby docs say:
"You should not set ::default_internal in Ruby code as strings created before changing the value may have a different encoding from strings created after the change. Instead you should use ruby -E to invoke Ruby with the correct default_internal."
See also: http://ruby-doc.org/core-1.9.3/Encoding.html#method-c-default_internal
Which encodings do the strings in both examples have?
I don't really know. One would have to have to look at the bytes and try to figure out if they are legal bytes for various plausible encodings, and beyond being legal, if they mean something likely to be intended.
For example take: "ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔǵÇ≠ǻǢ". That's a perfectly legal UTF-8 string, but as humans we know it's probably not intended, and is probably garbage, quite likely from the result of an encoding misinterpretation. But a computer has no way to know that, it's perfectly legal UTF-8, and, hey, maybe someone actually did mean to write "ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔǵÇ≠ǻǢ", after all, I just did, when writing this post!
So you can try to interpret the bytes according to various encodings and see if any of them make sense.
You're really just guessing at this point. Which means...
How can I load a file even if I don't know it's encoding?
Generally, you can not. You need to know and keep track of encodings. There's no real way to know what the bytes mean without knowing their encoding.
If you have some legacy data for which you've lost this, you've got to try to figure it out. Manually, or with some code that tries to guess likely encodings based on heuristics. Here's one Ruby gem Charlock Holmes that tries to guess, using the ICU library heuristics (this particular gem only works on MRI).
What Ruby says in response to string.encoding is just the encoding the string is tagged with. The string can be tagged with the wrong encoding, the bytes in the string don't actually mean what is intended in the encoding it's tagged with... in which case you'll get garbage.
Ruby will do the right things with your string instead of creating garbage only if the string's encoding tag is correct. The string's encoding tag is determined by Encoding.default_external for most input operations by default (Encoding.default_external usually starts out as UTF-8, or ASCII-8BIT which really means the null encoding, binary data, not tagged with an encoding), or by passing an argument to File.open: File.open("something", "r:UTF-8" or, means the same thing, File.open("something", "r", :encoding => "UTF-8"). The actual bytes are determined by whatever is in the file. It's up to you to tell Ruby the correct encoding to interpret those bytes as text meaning what they were intended to mean.
There were a couple posts recently to reddit /r/ruby that try to explain how to troubleshoot and workaround encoding issues that you may find helpful:
http://www.justinweiss.com/articles/how-to-get-from-theyre-to-theyre/
http://www.justinweiss.com/articles/3-steps-to-fix-encoding-problems-in-ruby/
Also, this is my favorite article on understanding encoding generally: http://kunststube.net/encoding/
For YAML files in particular, if I were you, I'd just make sure they are all in UTF-8. Life will be much easier and you won't have to worry about it. If you have some legacy ones that have become corrupted, it's going to be a pain to fix them, but that's what you've got to do, unless you can just rewrite them from scratch. Try to fix them to be in valid and correct UTF-8, and from here on out keep all your YAML in UTF-8.
The YAML specification says in "5.1. Character Set":
To ensure readability, YAML streams use only the printable subset of the Unicode character set. The allowed character range explicitly excludes the C0 control block #x0-#x1F (except for TAB #x9, LF #xA, and CR #xD which are allowed), DEL #x7F, the C1 control block #x80-#x9F (except for NEL #x85 which is allowed), the surrogate block #xD800-#xDFFF, #xFFFE, and #xFFFF.
This means that Windows-1252 or ISO-8859-1 encoding are acceptable as long as the characters being output are within the defined range. Windows users tend to use the "the C1 control block #x80-#x9F" range for diacritical and accented characters, so if those are present in a YAML file the file is not going to meet the spec and the YAML generator didn't do its job correctly. And that explains why "ä" isn't acceptable.
On output, a YAML processor must only produce acceptable characters. Any excluded characters must be presented using escape sequences. In addition, any allowed characters known to be non-printable should also be escaped. This isn’t mandatory since a full implementation would require extensive character property tables.
These days, by default, Ruby uses UTF-8, however YAML isn't limited to that. The spec goes on to say in "5.2. Character Encodings":
On input, a YAML processor must support the UTF-8 and UTF-16 character encodings. For JSON compatibility, the UTF-32 encodings must also be supported.
If a character stream begins with a byte order mark, the character encoding will be taken to be as as indicated by the byte order mark. Otherwise, the stream must begin with an ASCII character. This allows the encoding to be deduced by the pattern of null (#x00) characters.
So, UTF-8, 16 and 32 are supported, but Ruby will assume UTF-8. If the BOM is present you'll see it when you view the file in an editor. I haven't tried loading a UTF-16 or 32 file to see what Ruby's YAML does, so that's left as an experiment.

Ruby URI.extract returns empty array or ArgumentError: invalid byte sequence in UTF-8

I'm trying to get a list of files from url like this:
require 'uri'
require 'open-uri'
url = 'http://www.wmprof.com/media/niti/download'
html = open(url).read
puts URI.extract(html).select{ |link| link[/(PL)/]}
This code returns ArgumentError: invalid byte sequence in UTF-8 in line with URI.extract (even though html.encoding returns utf-8)
I've found some solutions to encoding problems, but when I'm changing the code to
html.encode('UTF-8', invalid: :replace, undef: :replace, replace: '?')
URI.extract returns empty string, even when I'm not calling the select method on it. Any suggestions?
The character encoding of the website might be ISO-8859-1 or a related one. We can't tell for sure since there are only two occurrences of the same non-US-ASCII-character and it doesn't really matter anyway.
html.each_char.reject(&:ascii_only?) # => ["\xDC", "\xDC"]
Finding the actual encoding is done by guessing. The age of HTML 3.2 or the used language/s might be a clue. And in this case especially the content of the PDF file is helpful (it contains SPRÜH-EX and the file has the name TI_DE_SPR%dcH_EX.pdf). Then we only need to find the encoding for which "\xDC" and "Ü" are equal. Either by knowing it or writing some Ruby:
Encoding.list.select { |e| "Ü" == "\xDC".encode!(Encoding::UTF_8, e) rescue next }.map(&:name)
Of course, letting a program do the guessing is an option too. There is the libguess library. The web browser can do it too. However you you need to download the file though unless the server might tell the browser it's UTF-8 even if it isn't (like in this case). Any decent text editor will also try to detect the file encoding: e.g. ST3 thinks it's Windows 1252 which is a superset of ISO-8859-1 (like UTF-8 is of US-ASCII).
Possible solutions are manually setting the string encoding to ISO-8859-1:
html.force_encoding(Encoding::ISO_8859_1)
Or (preferably) transcoding the string from ISO-8859-1 to UTF-8:
html.encode!(Encoding::UTF_8, Encoding::ISO_8859_1)
To answer the other question: URI.extract isn't the method you're looking for. Apparently it's obsolete and more importantly, it doesn't extract relative URI.
A simple alternative is using a regular expression with String#scan. It works with this site but it might not with other ones. You have to use a HTML parser for the best reliability (there might be also a gem). Here's an example that should do what you want:
html.scan(/href="(.*?PL.*?)"/).flatten # => ["SI_PL_ACTIV_bicompact.pdf", ...]

Convert a unicode string to characters in Ruby?

I have the following string:
l\u0092issue
My question is how to convert it to utf8 characters ?
I have tried that
1.9.3p484 :024 > "l\u0092issue".encode('utf-8')
=> "l\u0092issue"
You seem to have got your encodings into a bit of a mix up. If you haven’t already, you should first read Joel Spolsky’s article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) which provides a good introduction into this type of thing. There is a good set of articles on how Ruby handles character encodings at http://graysoftinc.com/character-encodings/understanding-m17n-multilingualization. You could also have a look at the Ruby docs for String and Encoding.
In this specific case, the string l\u0092issue means that the second character is the character with the unicode codepoint 0x92. This codepoint is PRIVATE USE TWO (see the chart), which basically means this position isn’t used.
However, looking at the Windows CP-1252 encoding, position 0x92 is occupied by the character ’, so if this is the missing character the the string would be l’issue, whick looks a lot more likely even though I don’t speak French.
What I suspect has happened is your program has received the string l’issue encoded in CP-1252, but has assumed it was encoded in ISO-8859-1 (ISO-8859-1 and CP-1252 are quite closely related) and re-encoded it to UTF-8 leaving you with the string you now have.
The real fix for you is to be careful about the encodings of any strings that enter (and leave) your program, and how you manage them.
To transform your string to l’issue, you can encode it back to ISO-8859-1, then use force_encoding to tell Ruby the real encoding of CP-1252, and then you can re-encode to UTF-8:
2.1.0 :001 > s = "l\u0092issue"
=> "l\u0092issue"
2.1.0 :002 > s = s.encode('iso-8859-1')
=> "l\x92issue"
2.1.0 :003 > s.force_encoding('cp1252')
=> "l\x92issue"
2.1.0 :004 > s.encode('utf-8')
=> "l’issue"
This is only really a demonstration of what is going on though. The real solution is to make sure you’re handling encodings correctly.
That is encoded as UTF-8 (unless you changed the original string encoding). Ruby is just showing you the escape sequences when you inspect the string (which is why IRB does there). \u0092 is the escape sequence for this character.
Try puts "l\u0092issue" to see the rendered character, if your terminal font supports it.

"invalid byte sequence in UTF-8" in rspec controller response

We encounter the said error on some of our newer virtual machines, while other machines remain unaffected and wonder why and furthermore how to get rid of them.
the two main differences are as follows
vm_old:
debian squeeze
ruby1.9.2p0
vm_new:
debian wheezy
ruby1.9.2p320 (over rvm)
There naturally are more changes within the VMs, but i don't know which would affect this behavior.
We have a response containing umlauts within some of our controllers (ie. '{"message": "ü"}') and we have set # encoding: utf-8
Within the spec we test the response against a fixed string with this umlaut
it 'should test something' do
get :some_controller, format: :json
response.status.should == 200
json = ActiveSupport::JSON.decode(response.body)
json["message"].should == 'ü' # breaks on this line
# ... some more tests
end
The substitute for ü seems to be a random 4 digit string.
On occasion this string seems to be valid utf-8 and can be transfered.
We then have a failed spec instead of the error message in the title, since the random string is not the same as ü.
The spec file itself also has the # encoding: utf-8 on the first line.
We tried playing with the locale or with force_encoding('utf-8')
The question now becomes:
Has someone else encountered a problem like this?
and
How to solve it?
Edit: turns out it is not always starting with P\.
Edit 2:
Testing around showed it is a problem with the json decode.
The controller response is something like "{\"foo\": \"\u00fc\"}", decoding that results in random output where the sequence \u00fc used to be.
for simple reproduction:
bundle exec rails c
> ActiveSupport::JSON.decode(ActiveSupport::JSON.encode({:foo => "ü"})
rails version is 3.0.4
Edit 3:
Changing the JSON backend to Yaml seems to be a valid workaround.
I'm not certain if this will be of help to you, but I figured I'd toss it out there. For me, adding this code:
.encode('UTF-16le', :invalid => :replace, :replace => '').encode('UTF-8')
totally saved me. Essentially, it involves converting your UTF-8 encoding to UTF-16, and then encoding it back to UTF-8. More information is available here.

Ruby hexacode to unicode conversion

I crawled a website which contains unicode, an the results look something like, if in code
a = "\\u2665 \\uc624 \\ube60! \\uc8fd \\uae30 \\uc804 \\uc5d0"
May I know how do I do it in Ruby to convert it back to the original Unicode text which is in UTF-8 format?
If you have ruby 1.9, you can try:
a.force_encoding('UTF-8')
Otherwise if you have < 1.9, I'd suggest reading this article on converting to UTF-8 in Ruby 1.8.
short answer: you should be able to 'puts a', and see the string printed out. for me, at least, I can print out that string in both 1.8.7 and 1.9.2
long answer:
First thing: it depends on if you're using ruby 1.8.7, or 1.9.2, since the way strings and encodings were handled changed.
in 1.8.7:
strings are just lists of bytes. when you print them out, if your OS can handle it, you can just 'puts a' and it should work correctly. if you do a[0], you'll get the first byte. if you want to get each character, things are pretty darn tricky.
in 1.9.2
strings are lists of bytes, with an encoding. If the webpage was sent with the correct encoding, your string should already be encoded correctly. if not, you'll have to set it (as per Mike Lewis's answer). if you do a[0], you'll get the first character (the heart). if you want each byte, you can do a.bytes.
If your OS, for whatever reason, is giving you those literal ascii characters,my previous answer is obviously invalid, disregard it. :P
here's what you can do:
a.gsub(/\\u([a-z0-9]+)/){|p| [$1.to_i(16)].pack("U")}
this will scan for the ascii string '\u' followed by a hexadecimal number, and replace it with the correct unicode character.
You can also specify the encoding when you open a new IO object: http://www.ruby-doc.org/core/classes/IO.html#M000889
Compared to Mike's solution, this may prevent troubles if you forget to force the encoding before exposing the string to the rest of your application, if there are multiple mechanisms for retrieving strings from your module or class. However, if you begin crawling SJIS or KOI-8 encoded websites, then Mike's solution will be easier to adapt for the character encoding name returned by the web server in its headers.

Resources