Ruby URI.extract returns empty array or ArgumentError: invalid byte sequence in UTF-8 - ruby

I'm trying to get a list of files from url like this:
require 'uri'
require 'open-uri'
url = 'http://www.wmprof.com/media/niti/download'
html = open(url).read
puts URI.extract(html).select{ |link| link[/(PL)/]}
This code returns ArgumentError: invalid byte sequence in UTF-8 in line with URI.extract (even though html.encoding returns utf-8)
I've found some solutions to encoding problems, but when I'm changing the code to
html.encode('UTF-8', invalid: :replace, undef: :replace, replace: '?')
URI.extract returns empty string, even when I'm not calling the select method on it. Any suggestions?

The character encoding of the website might be ISO-8859-1 or a related one. We can't tell for sure since there are only two occurrences of the same non-US-ASCII-character and it doesn't really matter anyway.
html.each_char.reject(&:ascii_only?) # => ["\xDC", "\xDC"]
Finding the actual encoding is done by guessing. The age of HTML 3.2 or the used language/s might be a clue. And in this case especially the content of the PDF file is helpful (it contains SPRÜH-EX and the file has the name TI_DE_SPR%dcH_EX.pdf). Then we only need to find the encoding for which "\xDC" and "Ü" are equal. Either by knowing it or writing some Ruby:
Encoding.list.select { |e| "Ü" == "\xDC".encode!(Encoding::UTF_8, e) rescue next }.map(&:name)
Of course, letting a program do the guessing is an option too. There is the libguess library. The web browser can do it too. However you you need to download the file though unless the server might tell the browser it's UTF-8 even if it isn't (like in this case). Any decent text editor will also try to detect the file encoding: e.g. ST3 thinks it's Windows 1252 which is a superset of ISO-8859-1 (like UTF-8 is of US-ASCII).
Possible solutions are manually setting the string encoding to ISO-8859-1:
html.force_encoding(Encoding::ISO_8859_1)
Or (preferably) transcoding the string from ISO-8859-1 to UTF-8:
html.encode!(Encoding::UTF_8, Encoding::ISO_8859_1)
To answer the other question: URI.extract isn't the method you're looking for. Apparently it's obsolete and more importantly, it doesn't extract relative URI.
A simple alternative is using a regular expression with String#scan. It works with this site but it might not with other ones. You have to use a HTML parser for the best reliability (there might be also a gem). Here's an example that should do what you want:
html.scan(/href="(.*?PL.*?)"/).flatten # => ["SI_PL_ACTIV_bicompact.pdf", ...]

Related

Ruby character encoding issue with scraped HTML

I'm having a character encoding issue with a Ruby script that does some HTML scraping and parsing with the Nokogiri gem. At one point in the script, I call join("\n") on an array of strings that have been pulled from some HTML, which causes this error:
./script.rb:333:in `join': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
In my logs, I can see Café showing up for some of the strings that would be included in the join operation.
Is it that some of the strings in my array to be joined are ASCII-8BIT and some are UTF-8 and ruby can't combine them? Do I need to convert or sanitize my strings after parsing them with Nokogiri (into UTF-8)?.
I tried force_encoding('UTF-8') and encode('UTF-8') on the scraped HTML content before I do anything else with it, but it didn't help. In fact, after I tried encode('UTF-8'), my script crashed even earlier when it called to_s on a string containing Café.
Character encoding always really confuses me. Is there something else I can do to sanitize the strings to avoid this error?
Edit:
I was doing something similar in Perl recently and used a module called Text::Unidecode and was able to pass my strings to a function that translates any problematic characters e.g. the letter a with an acute to the plain letter a. Is there anything similar for ruby? (This isn't necessarily what I'm aiming for though, if I can keep the a with acute then that's preferable I think.
Edit2:
I'm really confused by this and it's proving difficult to reproduce reliably. Here's some code:
[CODE REMOVED]
Edit3:
I removed the previously posted code example because it wasn't correct. But the bottom line is, whenever I try to print or call to_s on the string that was scraped, I get the encoding error.
Edit4:
It turned out in the end that the scraped html input was not what was causing the problem. I got the encoding error whenever I tried to print or call to_s on a hash containing, among other things, the scraped html text. The 'other things' were values from database queries, and they were being returned in ASCII-8BIT. To fix the issue, I explicitly had to call force_encoding('UTF-8') on each database value that I use (although I hear that the mysql2 gem does this automatically so I should switch to that).
I hate character encoding.
Presumably, Café is supposed to be Café. If we start out with Café in UTF-8 but treat the bytes as though they were encoded in ISO-8859-1 (AKA Latin-1) and then re-encode them as UTF-8, we get the Café that you're seeing; for example:
> s = 'Café'
=> "Café"
> s.encoding
=> #<Encoding:UTF-8>
> s.force_encoding('iso-8859-1').encode('utf-8')
=> "Café"
So somewhere you're reading a UTF-8 string but treating it as Latin-1 and re-encoding it as UTF-8. I'd guess that Nokogiri is reading the page and thinking that it is Latin-1 or being told by your user agent that it is getting Latin-1 text. Perhaps you have a bad default encoding somewhere, or the HTTP headers are lying about the encoding, or the page itself is lying about its encoding.
You need to get everything into UTF-8 at the edges of your scraper. Figure out who is lying about the encoding and sort it out right there.
Don't feel bad, scraping and encoding is a nightmare of confusion, stupidity, guesswork, and hard liquor. Servers lie, pages lie, browsers lie, no one is happy.

"invalid byte sequence in UTF-8" in rspec controller response

We encounter the said error on some of our newer virtual machines, while other machines remain unaffected and wonder why and furthermore how to get rid of them.
the two main differences are as follows
vm_old:
debian squeeze
ruby1.9.2p0
vm_new:
debian wheezy
ruby1.9.2p320 (over rvm)
There naturally are more changes within the VMs, but i don't know which would affect this behavior.
We have a response containing umlauts within some of our controllers (ie. '{"message": "ü"}') and we have set # encoding: utf-8
Within the spec we test the response against a fixed string with this umlaut
it 'should test something' do
get :some_controller, format: :json
response.status.should == 200
json = ActiveSupport::JSON.decode(response.body)
json["message"].should == 'ü' # breaks on this line
# ... some more tests
end
The substitute for ü seems to be a random 4 digit string.
On occasion this string seems to be valid utf-8 and can be transfered.
We then have a failed spec instead of the error message in the title, since the random string is not the same as ü.
The spec file itself also has the # encoding: utf-8 on the first line.
We tried playing with the locale or with force_encoding('utf-8')
The question now becomes:
Has someone else encountered a problem like this?
and
How to solve it?
Edit: turns out it is not always starting with P\.
Edit 2:
Testing around showed it is a problem with the json decode.
The controller response is something like "{\"foo\": \"\u00fc\"}", decoding that results in random output where the sequence \u00fc used to be.
for simple reproduction:
bundle exec rails c
> ActiveSupport::JSON.decode(ActiveSupport::JSON.encode({:foo => "ü"})
rails version is 3.0.4
Edit 3:
Changing the JSON backend to Yaml seems to be a valid workaround.
I'm not certain if this will be of help to you, but I figured I'd toss it out there. For me, adding this code:
.encode('UTF-16le', :invalid => :replace, :replace => '').encode('UTF-8')
totally saved me. Essentially, it involves converting your UTF-8 encoding to UTF-16, and then encoding it back to UTF-8. More information is available here.

gsub :: ArgumentError (invalid byte sequence in UTF-8)

This code uses the Hpricot gem to get HTML that contains UTF-8 characters.
# <div>This is a test测试</div>
div[0].to_html.gsub(/test/, "")
When that is run, it spits out this error (pointing at gsub):
ArgumentError (invalid byte sequence in UTF-8)
How can we fix this issue?
Figured out the issue. Hpricot's to_html calls methods that trigger the error so to get rid of that we need to make the Hpricot document encoding UTF-8, not just that one string. We do that like this:
ic = Iconv.new("UTF-8//IGNORE", "UTF-8")
doc = open("http://example.com") {|f| Hpricot(ic.iconv(f.read)) }
And then we can call other Hpricot methods but now the whole document has UTF-8 encoding and it won't give us any errors.
The to_html looks to return a non-utf8 string in this case.
I had same problem with file containing some non-utf8 characters. The fix I found is not really beautiful, but it could also works for your case :
the_utf8_string = the_non_utf8_string.unpack('C*').pack('U*')
Be careful, I'm not sure there is no one data lost.

Encoding Unicode code points with Ruby

I'm retrieving an HTML document that is parsed with Nokogiri. The HTML is using charset ISO-8859-1. The problem is there are some Unicode chars in the document which are converted to Unicode code points instead of their respective character.
For example, this is some text in the HTML as received (in ISO-8859-1):
\x95\x95 JOHNNY VENETTI \x95\x95
And when attempting to work with this text, it gets converted to this:
\u0095\u0095 JOHNNY VENETTI \u0095\u0095
So my question is, how can I ensure those characters are represented as their appropriate character instead of the code point? I've tried doing a gsub on the text, but that seems wrong for this. Also, I do not have control over the encoding of the HTML document.
First you should realize that this string is NOT ISO-8859-1 encoded (file says "Non-ISO extended-ASCII text" and the codepage verifies this). May well be this is your problem, in that case you should specify the right encoding (probably something like Windows-1252, in this case) in your HTML document.
In Nokogiri, you can also set the encoding explicitly in cases where the document specifies the wrong encoding:
Nokogiri.HTML("<p>\x95\x95 JOHNNY VENETTI \x95\x95</p>", nil, "Windows-1252")
# => #<Nokogiri::HTML::Document: ...
# children=[#<Nokogiri::XML::Text:0x15744cc "•• JOHNNY VENETTI ••">]>]>]>]>
If you don't have the option to solve this cleanly like above, you can also do it the hard way and associated the string with its correct encoding:
s = "\x95\x95 JOHNNY VENETTI \x95\x95"
s.encoding # => #<Encoding:ASCII-8BIT>
s.force_encoding 'Windows-1252'
s.encode! 'utf-8'
s # => "•• JOHNNY VENETTI ••"
Note that this last piece of code is Ruby 1.9 only. If you want, you can read more about the new encoding system in Ruby 1.9.

Incompatible encodings with ruby and Nokogiri HTML

I'm parsing an external HTML page with Nokogiri. That page is encoded with ISO-8859-1. Part of the data I want to extract, contains some – (dash) html entities:
xml = Nokogiri.HTML(open("http://flybynight.com.br/agenda.php"), nil, 'ISO-8859-1')
f = xml.xpath("//div[#style='background-color:#D9DBD9; padding:15px 12px 10px 10px;']//div[#class='tit_inter_cnz']/text()")
f[0].text #=> Preview M/E/C/A \u0096 John Digweed
In the last line, the String should be rendered on the browser with a dash. The browser correctly renders it if I specify my page as ISO-8859-1 encoding, however, my Sinatra app uses UTF-8. How can I correctly display that text in the browser? Today is is being displayed as a square with a small number inside.
I tried force_encoding('ISO-8859-1'), but then I get a CompatibilityError from Sinatra.
Any clues?
[Edit]
Below are screenshots of the app:
-> Firefox with character encoding UTF-8
-> [Firefox with character encoding Western (ISO-8859-1)
It's worth mentioning that in the ISO-8859-1 mode above, the dash is shown correctly, but there is another incorrect character with it just before the dash. Weird :(
After parsing a document in Nokogiri you can tell it to assume a different encoding. Try:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML((open("http://flybynight.com.br/agenda.php"), nil, 'ISO-8859-1')
doc.encoding = 'UTF-8'
I can't see that page from here, to confirm this fixes the problem, but it's worked for similar problems.
Summary: The problematic characters are control characters from ISO-8859-1, not intended for display.
Details and Investigation:
Here's a test showing that you are getting valid UTF-8 from Nokogiri and Sinatra:
require 'sinatra'
require 'open-uri'
get '/' do
html = open("http://flybynight.com.br/agenda.php").read
p [ html.encoding, html.valid_encoding? ]
#=> [#<Encoding:ISO-8859-1>, true]
str = html[ /Preview.+?John Digweed/ ]
p [ str, str.encoding, str.valid_encoding? ]
#=> ["Preview M/E/C/A \x96 John Digweed", #<Encoding:ISO-8859-1>, true]
utf8 = str.encode('UTF-8')
p [ utf8, utf8.encoding, utf8.valid_encoding? ]
#=> ["Preview M/E/C/A \xC2\x96 John Digweed", #<Encoding:UTF-8>, true]
require 'nokogiri'
doc = Nokogiri.HTML(html, nil, 'ISO-8859-1')
p doc.encoding
#=> "ISO-8859-1"
dig = doc.xpath("//div[#class='tit_inter_cnz']")[1]
p [ dig.text, dig.text.encoding, dig.text.valid_encoding? ]
#=> ["Preview M/E/C/A \xC2\x96 John Digweed", #<Encoding:UTF-8>, true]
<<-ENDHTML
<!DOCTYPE html>
<html><head><title>Dig it!</title></head><body>
<p>Here it comes...</p>
<p>#{dig.text}</p>
</body></html>
ENDHTML
end
This properly serves up content with Content-Type:text/html;charset=utf-8 on my computer. Chrome does not show my this character in the browser, however.
Analyzing that response, the same Unicode byte pair comes back for the dash as is seen in the above: \xC2\x96. This appears to be this Unicode character which seem to be an odd dash.
I would chalk this up to bad source data, and simply throw:
#encoding: UTF-8
at the top of your Ruby source file(s), and then put in:
f = ...text.gsub( "\xC2\x96", "-" ) # Or a better Unicode character
Edit: If you look at the browser test page for that character you will see (at least in in Chrome and Firefox for me) that the UTF-8 literal version is blank, but the hex and decimal escape versions show up. I cannot fathom why this is, but there you have it. The browsers are simply not displaying your character correctly when presented in raw form.
Either make it an HTML entity, or a different Unicode dash. Either way a gsub is called for.
Edit #2: One more odd note: the character in the source encoding has a hexadecimal byte value of 0x96. As far as I can tell, this does not appear to be a printable ISO-8859-1 character. As shown in the official spec for ISO-8859-1, this falls in one of the two non-printing regions.
I work in publishing of scientific manuscripts and there are many dashes. The dash that you are using is not an ASCII dash, it is a unicode dash. Forcing the ISO encoding is probably having the effect of making the dash change.
http://www.fileformat.info/info/unicode/char/96/index.htm
That site is excellent for unicode issues.
The reason you are getting a square is that perhaps your browser does not support this. It is probably correctly rendered. I would keep UTF-8 encoding, and if you want to make that dash so everyone can see it, convert it to an ascii dash.
You may want to try Iconv to convert the characters to ASCII/UTF-8 http://craigjolicoeur.com/blog/ruby-iconv-to-the-rescue

Resources